Title: Enabling Generative AI- Exploring RAS Requirements for Hyperscale AI Infrastructure
Speaker: Rama Bhimanadhuni
Generative AI workloads have led to a fast increase of GPUs and accelerators in Cloud Data Centers at hyperscale. AI workloads are evolving swiftly, which creates more demand for hardware resources such as computational power, memory, networking, and high-speed interconnects. However, at the scale of AI supercomputers, hardware failure rates are also increasing across these resources, requiring RAS technology innovations spanning across Silicon, Server, Firmware, Software, Rack, and Fleet. Based on insights from Hyperscale AI infrastructure fleets, this technical talk will explain the importance of RAS requirements for reducing job disruptions, ensuring Hardware Error Resilience, improving Maintenance and Serviceability, enabling Root Cause Analysis and Failure Prediction. Moreover, the session will showcase how RAS standardization across GPUs and Accelerators has been attempted recently across the industry through the OCP Hardware management workstream.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.