Invited_talk_Rama_Bhimanadhuni_Abstract

Title: Enabling Generative AI- Exploring RAS Requirements for Hyperscale AI Infrastructure

Speaker: Rama Bhimanadhuni

Generative AI workloads have led to a fast increase of GPUs and accelerators in Cloud Data Centers at hyperscale. AI workloads are evolving swiftly, which creates more demand for hardware resources such as computational power, memory, networking, and high-speed interconnects. However, at the scale of AI supercomputers, hardware failure rates are also increasing across these resources, requiring RAS technology innovations spanning across Silicon, Server, Firmware, Software, Rack, and Fleet. Based on insights from Hyperscale AI infrastructure fleets, this technical talk will explain the importance of RAS requirements for reducing job disruptions, ensuring Hardware Error Resilience, improving Maintenance and Serviceability, enabling Root Cause Analysis and Failure Prediction. Moreover, the session will showcase how RAS standardization across GPUs and Accelerators has been attempted recently across the industry through the OCP Hardware management workstream.

Keynote

Corporate Vice President, General Manager, Data Center and AI Product Management, Intel Corporation