Timeslot | Topic | Organizer | Moderator | Speaker | Title | |
7:00 – 8:30 | Registration and Breakfast | |||||
8:30 – 9:00 | Opening Address | Jyotika Athavale BIO, Yervant Zorian BIO, Dimitris Gizopoulos BIO, Amr Haggag (Chairs) | Welcome | |||
9:00 – 10:00 | 9:00-9:30 | Keynote 1 | Jyotika Athavale (IEEE Computer Society) | Ankur Garg (Microsoft) BIO | Navigating the AI Era: The Essential Role of RAS ABSTRACT | |
9:30-10:00 | Keynote 2 | Steve Hesley (AMD) BIO | The Vital Role of RAS ABSTRACT | |||
10:00 – 10:30 | Coffee Break | |||||
10:30 – 12:00 | Invited Special Session 1 (SS1 Quality) | Rama Govindaraju (Google) | Rama Govindaraju (Google) | Subhasish Mitra (Stanford) BIO Vilas Sridharan (AMD) BIO Harish Dixit (Meta) BIO David Lerner (Intel) BIO | SDC (ELF and best practices for in-situ screening) A Cambrian Explosion in Robust Computing Systems is Dead Ahead ABSTRACT Addressing emerging fault modes with testing and reliability ABSTRACT Silent Data Corruptions at Scale ABSTRACT Silent Data Errors: Causes, Implications to AI Workloads, and In-Field Mitigation ABSTRACT | |
12:00 – 13:00 | Lunch | |||||
13:00 – 14:00 | 13:00-13:30 | Keynote 3 | Yervant Zorian (Synopsys) | George Tchaparian (OCP) BIO | Challenges and OCP Community Progress towards Reliability, Availability, and Serviceability (RAS) with a Special Focus on Artificial Intelligence ABSTRACT | |
13:30-14:00 | Keynote 4 | Zane Ball (Intel) BIO | Pioneering Reliability, Availability, and Serviceability in the AI era ABSTRACT | |||
14:00 – 15:30 | Invited Special Session 2 (SS2 Reliability) | Rama Bhimanadhuni (Microsoft) | Shawn Blanton (CMU) | Yogesh Varma (Intel) BIO Rama Bhimanadhuni (MSFT) BIO Drew Walton BIO Dimitris Gizopoulos (U of Athens) BIO | Hardware Fault Management Towards Autonomous Hardware Fault Management ABSTRACT Enabling Generative AI- Exploring RAS Requirements for Hyperscale AI Infrastructure ABSTRACT Efforts to Address Fault Management Challenges in OCP ABSTRACT The Role of Abstraction and Modeling in the Assessment of Hardware Faults Effects ABSTRACT | |
15:30 – 16:00 | Coffee Break | |||||
16:00 – 17:30 | Invited Special Session 3 (SS3 Availability) | Yogesh Varma (Intel) | Cecilia Metra (U of Bologna) | Swadesh Choudhary (Intel) BIO Yervant Zorian (Synopsys) BIO Arijit Biswas (Intel) BIO Sanjay Gongalore (Nvidia) BIO | Pathfinding Toward SoC Self-Healing Architecture UCIe RAS Overview ABSTRACT RAS Challenges & Solution for Today’s Chiplet-based Systems ABSTRACT Pathfinding Toward SoC Self-Healing Architecture ABSTRACT Maximizing Availability for a Zettascale Datacenter ABSTRACT | |
17:30 – 19:00 | Reception with IEEE CS BOG | |||||
19:00 – 20:30 | Invited Special Session 4 (SS4 Serviceability) | Drew Walton | Drew Walton | John Holm (Intel) BIO Rob Chappell (Microsoft) BIO Anil Agrawal (Meta) BIO Amit Pandey (Amazon) BIO | In-fleet Serviceability Enhancing Computer Serviceability Through Error Telemetry ABSTRACT Challenges in Hyperscale Serviceability ABSTRACT In-band Error Handling requirements for RAS in hyperscale data centers ABSTRACT Addressing Serviceability throughout device lifecycle with High Speed Access for Test ABSTRACT |
07:00 – 08:00 | Breakfast |
|
|
|
|
08:00 – 09:15 | Session 1 | Session 2 Memory and Interconnects 1 |
| 1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta) | 2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers Shen Zhou, Dahai Zhou, Gaoyu Ruan, Zhibing Li, Yi Li and Keke Xie (Intel and Alibaba) |
| 1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale Tejasvi Chakravarthy, Sankarnarayanan Gurumurthy, Harish Dattatreya Dixit and Abishek Hariharan (Meta and AMD) | 2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux) |
| 1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft) | 2.3 – Standardized RAS API using CXL Component Command Interface Shubhada Pugaonkar and Antonio Hasbun Marin (Intel) |
| 1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta) | 2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance) |
| 1.5 – Open Compute Project’s Server Resilience Specification 1.0 Thiago Macieira (Intel) | 2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters Anil Agrawal and Bill Holland (Meta) |
|
|
|
09:15 – 10:30 | Session 3 Data Center RAS 2 | Session 4 Memory and Interconnects 2 |
| 3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors Thiago Macieira (Intel) | 4.1 – RAIDDR: Error Correction for Multi-device Busses Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft) |
| 3.2 – Writing SDE-finding tests using OpenDCDiag Rohit Agashe and Thiago Macieira (Intel) | 4.2 – CXL RAS learnings Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel) |
| 3.3 – Maintaining data integrity during transformation Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel) | 4.3 – Data Centers’ Reliability Risks due to Faults Affecting their High Performance Microprocessors’ Caches Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel) |
| 3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens) | 4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft) |
| 3.5 – The Challenges of Operating a Heterogeneous Edge Cluster Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel) | 4.5 – Managing Memory Correctable Error Solutions Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent) |
|
|
|
10:30 – 11:00 | Coffee Break |
|
|
|
|
11:00 – 12:30 | Session 5 AI and RAS | Session 6 Testing and Resilience |
| 5.1 – Meta AI Server Reliability Dimensional Analysis Peng Xiao and Mihir Patel (Meta) | 6.1 – Delay Monitoring Under Different PVT Corners Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens) |
| 5.2 – Comprehensive Reliability Analysis in AI systems Anju John, Matt Bergeron and Mihir Patel (Meta) | 6.2 – Timing-Verification Test for Timing Related Defects Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens) |
| 5.3 – Build High Reliability/Availability/Serviceability head node for AI server Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance) | 6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption Irith Pomeranz and Yervant Zorian (Purdue U and Synopsys) |
| 5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions Xun Jiao, Fred Lin and Harish Dixit (Meta) | 6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech) |
| 5.5 – What does measuring resilience in AI systems entail? Chitkala Sethuraman (Microsoft) |
|
| 5.6 – Dual Transformer Encoding: Remaining Useful Life Estimation through Channel-Independent and Collective Approach Paul Nikolian and Fadi Kurdahi (UC Invine) |
|
|
|
|
12:30 – 14:00 | Networking Lunch |
|
07:00 – 08:00 | Breakfast |
|
(Speaker Name Highlighted) |
| |
08:00 – 09:15 | Session 1 Data Center RAS 1 Moderator: Bharath Parthasarathy | Session 2 Memory and Interconnects 1 Moderator: Kwabena Boateng |
1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta) | 2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers Shen Zhou, Dahai Zhou, Haoyu Ruan, Zhibing Li, Yi Li, Keke Xie, and Yogesh Varma (Intel and Alibaba) | |
1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale Gautham Vunnam, Abishek Hariharan, Sankarnarayanan Gurumurthy, Tejasvi Chakravarthy, Harish Dattatraya Dixit (Meta and AMD) | 2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux) | |
1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft) | 2.3 – Standardized RAS API using CXL Component Command Interface Shubhada Pugaonkar and Antonio Hasbun Marin (Intel) | |
1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta) | 2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance) | |
1.5 – Open Compute Project’s Server Resilience Specification 1.0 Thiago Macieira (Intel) | 2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters Anil Agrawal and Bill Holland (Meta) | |
|
| |
09:15 – 10:30 | Session 3 Data Center RAS 2 Moderator: Harish Dixit | Session 4 Memory and Interconnects 2 Moderator: Sreejit Chakravarty |
3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors Thiago Macieira (Intel) | 4.1 – RAIDDR: Error Correction for Multi-device Busses Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft) | |
3.2 – Writing SDE-finding tests using OpenDCDiag Rohit Agashe and Thiago Macieira (Intel) | 4.2 – CXL RAS learnings Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel) | |
3.3 – Maintaining data integrity during transformation Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel) | 4.3 – Data Centers’ Reliability Risks due to Faults Affecting their High Performance Microprocessors’ Caches Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel) | |
3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens) | 4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft) | |
3.5 – The Challenges of Operating a Heterogeneous Edge Cluster Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel) | 4.5 – Managing Memory Correctable Error Solutions Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Zengping Xu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent) | |
10:30 – 11:00 | Coffee Break | |
11:00 – 12:30 | Session 5 AI and RAS Moderator: Preeti Chauhan | Session 6 Testing and Resilience Moderator: Chris Connor |
5.1 – Meta AI Server Reliability Dimensional Analysis Peng Xiao and Mihir Patel (Meta) | 6.1 – Delay Monitoring Under Different PVT Corners Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens) | |
5.2 – Comprehensive Reliability Analysis in AI systems Anju John, Matt Bergeron and Mihir Patel (Meta) | 6.2 – Timing-Verification Test for Timing Related Defects Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens) | |
5.3 – Build High Reliability/Availability/Serviceability head node for AI server Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance) | 6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption Irith Pomeranz (Purdue U) | |
5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions Xun Jiao, Fred Lin and Harish Dixit (Meta) | 6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech) | |
5.5 – What does measuring resilience in AI systems entail? Chitkala Sethuraman (Microsoft) | ||
5.6 – Multi Channel Transformer: Remaining Useful Life Estimation through Channel-Independent and Collective Approach Paul Nikolian and Fadi Kurdahi (UC Invine) | ||
12:30 – 14:00 | Networking Lunch |
Registration will be located outside of Sedona
Registration Hours: 7:00 am – 8:30 am
Tuesday, June 11
All sessions will be held in Sedona
Breakfast & Lunch will be located in Salons 7-9
Reception will be held in the Orchard Lounge
Wednesday, June 12 (half day session)
Breakout Session #1- Salon A
Breakout Session #2- Salon B
Breakfast and Lunch will be located in Salons 7-9
Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle. Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business. Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.
Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University. He also holds six patents in high-speed electrical design.