Title: Silent Data Errors: Causes, Implications to AI Workloads, and In-Field Mitigation
Speaker: David Lerner
Abstract: Silent Data Errors (SDE) result in unpredictable system behavior and are a serious concern for at-scale compute in data center operations. This talk will review characterization results of physical defect mechanisms that result in SDE events and discuss their impact on Artificial Intelligence (AI) workloads. A summary of SDE mitigation tools and data from testing over 1 million processors will be shared, along with the implications for in-field mitigation of hard-to-detect defects that manifest as SDE.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.