Title: Pathfinding Toward a Self-Healing Architecture in Data Centers
Speaker: Arijit Biswas
Abstract:
Muti-process, chiplet based System-on-Chip datacenter architectures are dominating xPU designs and integrate various parallel processing accelerators with more traditional execution pipelines. Due to very high complexity of such SoCs, and associated costs to manufacture, package, and assure their quality, new architectures will require novel RAS approaches that are cost-effective, adaptable and re-configurable for diversified target workloads such as AI/ML, HPC, General Computing, Low-Latency Processing, Graphics, Cloud and Edge, Communications and Embedded markets. Additionally, system level integration of such functions provides both challenges and opportunities to optimize and target the solution space based on usages & workloads at the system, node, rack or data center levels to enable better total costs of ownership and value customizations.
Continuous improvement in the quality & reliability space includes various improvements across manufacturing, test & product development. From a technology perspective this involves new technologies and methods for fault detection, system diagnostics & error recovery. This level of reliability is a critical requirement to ensure that a wide variety of usages of such products, ranging from contractual requirements to life safety and even ensuring security, are successful. This means that a wide spectrum of reliability options will dictate solutions spaces where redundancy cost, time-to-market and sustainability are primary driving factors behind those choices and solutions.
High reliability computing architectures have traditionally focused on some aspect of reliability, availability, reconfigurability, diagnostics, prognostics, and various related verification & validation processes/methodologies (simulation/formal) as ways to assure continuation of computing services despite internal SoC errors or failures that affect some part of SoC logic. In our methodology, we take a different direction by first establishing the key pillars of a self-healing architecture, describe their attributes and then follow with individual methods to support those pillars.
Our methodology proposes (run-time) detection, seamless diagnostics, and recovery/failover as the 3 key pillars of the self-healing architecture – all bound together with clear interfaces and configurability in order to be able to operate as a virtuous cycle in an automated fashion to detect, diagnose and recover from various faults in the field without needing customer intervention. Detection incorporates both run-time detection capabilities as well as coverage and error containment. Diagnostics include the ability to run high coverage test content, including stress content, either at boot time or seamlessly during run-time – triggered either by detection or the user – at a granularity that matches with available recovery mechanisms. Finally, recovery incorporates a variety of mechanisms across a broad range of the data center system stack to enable the right level of recovery that enables high uptimes and availability.
Further, we determine that such an architecture must provide configurability via a set of unified control interfaces that allow (seamless) adjustment of system RAS capabilities by users; scalability to allow substantial change in support of partial or full RAS capabilities based on user’s need, where the primary factors are the key usages, the size of datacenter systems (rack, module, blade) and its required protection granularity. It is important to note that in a self-healing architecture, support of multiple pillars is required and those can be linked.
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.