Title: Maximizing Availability for a Zettascale Datacenter
Speaker: Sanjay Gongalore
Abstract:
Rapid adoption and growing complexity of Generative AI models is triggering a furious buildout of AI factories that are projected to reach Zettascale in 2025. Training the LLMs (Large Language Models) for generative AI reliably at scale is one of the toughest challenges in the datacenter today. The presentation will first establish terminology, then present self-healing approaches in data centers to maintain high availability and efficiency despite in-field hardware failures. The talk will cover topics such as modeling availability, fault attribution to allow minimal interruption, and recovery.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.