Session – Hardware Fault Management
Speaker: Yogesh Varma
Title: Towards autonomous hardware fault management
Hardware fault management of the modern data-center fleet is a complex undertaking with immense potential for improved TCO by data-driven decision-making for improving server system serviceability and availability. The fault handling consideration vary widely by server platform hardware RAS and telemetry, deployment types, and workloads. Interactable nature of hyperscale fleet error, telemetry and usage models naturally lends themselves for improved machine learning driven hardware fault management RAS actions. However, any system data-driven action can only be as good as the input data. At the Open Compute Project (OCP) Hardware Fault Management (HWFM) project Intel is partnering with key industry stakeholders to standardize a comprehensive framework for vendor agnostic hardware fault logging. This framework will enable AI assisted hardware fault analysis and autonomous RAS actions for contemporary datacenter fault management.
This opening talk of the special session on Hardware Fault Management will set the stage by discussing the north-star for such a data-driven hardware fault management framework. It will then discuss the key initiatives and contributions. It will then naturally segway the following session to discuss details of in-band and OOB hardware fault management framework, fleet memory fault management, RAS API standard, and GPU RAS initiatives at the Open Compute Project.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.