Title: In-band Error Handling requirements for RAS in hyperscale data centers
Speaker: Anil Agrawal
Abstract:
Serviceability is one of the three pillars of RAS. Hyperscale data centers require sophisticated serviceability features to minimize the cost and to meet target matrices such as higher ‘availability’, lower ‘Interruption Rate’, and higher ‘efficiency’. This session will cover Meta’s RAS requirements in building hyperscale data center systems used in building AI/ML training clusters with a specific emphasis on ‘in-band error handling’ requirements.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.