Title: Efforts to Address Fault Management Challenges in OCP
Speaker: Drew Walton
Abstract:
It is too hard for end users to understand and handle hardware errors correctly. We lack first principles understanding of the silicon design, we don’t know what many of the errors mean and we don’t know what data needs to be captured in order to understand what has failed. Modern SoCs have RAS features that can mitigate hardware failures, but these features are too hard to use and end users lack data and therefore consensus on which of these features are most useful.
This presentation will give an overview of key fault management efforts in OCP to address these challenges and provide a glimpse of what fault management will look like in future platforms. It will discuss how we are working across the industry to simplify the work needed to log errors, analyze the error logs and take the appropriate action to mitigate the failure. It will cover the efforts of the overall OCP Hardware Fault Management Team, the Fleet Memory Fault Management team and the RAS API Team. It will show how these efforts fit together to meet the challenges end users face and present some initial thoughts on possible next efforts for the fault management community.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.