IEEE-RAS-in-Data-Centers-Summit-logo
IEEE-RAS-in-Data-Centers-Summit-logo

Tuesday, 11th June 2024 (SingleTrack)

TimeslotTopicOrganizerModeratorSpeakerTitle
7:00 – 8:30Registration and Breakfast
8:30 – 9:00Opening AddressJyotika Athavale BIO, Yervant Zorian BIO, Dimitris Gizopoulos BIO, Amr Haggag (Chairs)Welcome
9:00 – 10:009:00-9:30Keynote 1Jyotika Athavale (IEEE Computer Society)Ankur Garg (Microsoft) BIONavigating the AI Era: The Essential Role of RAS ABSTRACT
9:30-10:00Keynote 2Steve Hesley (AMD) BIOThe Vital Role of RAS ABSTRACT
10:00 – 10:30Coffee Break
10:30 – 12:00Invited Special Session 1
(SS1 Quality)
Rama Govindaraju (Google)Rama Govindaraju (Google)Subhasish Mitra (Stanford) BIO
Vilas Sridharan (AMD) BIO
Harish Dixit (Meta) BIO
David Lerner (Intel) BIO

SDC (ELF and best practices for in-situ screening)

A Cambrian Explosion in Robust Computing Systems is Dead Ahead ABSTRACT

Addressing emerging fault modes with testing and reliability ABSTRACT

Silent Data Corruptions at Scale ABSTRACT

Silent Data Errors: Causes, Implications to AI Workloads, and In-Field Mitigation ABSTRACT

12:00 – 13:00Lunch
13:00 – 14:0013:00-13:30Keynote 3Yervant Zorian (Synopsys)George Tchaparian (OCP) BIOChallenges and OCP Community Progress towards Reliability, Availability, and Serviceability (RAS) with a Special Focus on Artificial Intelligence ABSTRACT
13:30-14:00Keynote 4Zane Ball (Intel) BIOPioneering Reliability, Availability, and Serviceability in the AI era ABSTRACT
14:00 – 15:30Invited Special Session 2
(SS2 Reliability)
Rama Bhimanadhuni (Microsoft)Shawn Blanton (CMU)Yogesh Varma (Intel) BIO
Rama Bhimanadhuni (MSFT) BIO
Drew Walton BIO
Dimitris Gizopoulos (U of Athens) BIO

Hardware Fault Management

Towards Autonomous Hardware Fault Management ABSTRACT

Enabling Generative AI- Exploring RAS Requirements for Hyperscale AI Infrastructure ABSTRACT

Efforts to Address Fault Management Challenges in OCP ABSTRACT

The Role of Abstraction and Modeling in the Assessment of Hardware Faults Effects ABSTRACT

15:30 – 16:00Coffee Break
16:00 – 17:30Invited Special Session 3
(SS3 Availability)
Yogesh Varma (Intel)Cecilia Metra (U of Bologna)Swadesh Choudhary (Intel) BIO
Yervant Zorian (Synopsys) BIO
Arijit Biswas (Intel) BIO
Sanjay Gongalore (Nvidia) BIO

Pathfinding Toward SoC Self-Healing Architecture

UCIe RAS Overview ABSTRACT

RAS Challenges & Solution for Today’s Chiplet-based Systems ABSTRACT

Pathfinding Toward SoC Self-Healing Architecture ABSTRACT

Maximizing Availability for a Zettascale Datacenter ABSTRACT

17:30 – 19:00Reception with IEEE CS BOG
19:00 – 20:30Invited Special Session 4
(SS4 Serviceability)
Drew WaltonDrew Walton John Holm (Intel) BIO
Rob Chappell (Microsoft) BIO
Anil Agrawal (Meta) BIO
Amit Pandey (Amazon) BIO

In-fleet Serviceability

Enhancing Computer Serviceability Through Error Telemetry ABSTRACT

Challenges in Hyperscale Serviceability ABSTRACT

In-band Error Handling requirements for RAS in hyperscale data centers ABSTRACT

Addressing Serviceability throughout device lifecycle with High Speed Access for Test ABSTRACT

Wednesday, 12th June 2024 (Dual Track)

07:00 – 08:00

Breakfast

 

 

 

 

08:00 – 09:15

Session 1
Data Center RAS 1

Session 2

Memory and Interconnects 1

 

1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale

Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta)     

2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers

Shen Zhou, Dahai Zhou, Gaoyu Ruan, Zhibing Li, Yi Li and Keke Xie (Intel and Alibaba)           

 

1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale

Tejasvi Chakravarthy, Sankarnarayanan Gurumurthy, Harish Dattatreya Dixit and Abishek Hariharan (Meta and AMD)

2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory

Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux)

 

1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization

Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft)

2.3 – Standardized RAS API using CXL Component Command Interface

Shubhada Pugaonkar and Antonio Hasbun Marin (Intel)

 

1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments

Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta)

2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention

Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance)

 

1.5 – Open Compute Project’s Server Resilience Specification 1.0

Thiago Macieira (Intel)

2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters

Anil Agrawal and Bill Holland (Meta)

 

 

 

09:15 – 10:30

Session 3

Data Center RAS 2

Session 4

Memory and Interconnects 2

 

3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors

Thiago Macieira (Intel)

4.1 – RAIDDR: Error Correction for Multi-device Busses

Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft)

 

3.2 – Writing SDE-finding tests using OpenDCDiag

Rohit Agashe and Thiago Macieira (Intel)

4.2 – CXL RAS learnings

Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel)

 

3.3 – Maintaining data integrity during transformation

Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel)

4.3 – Data Centers’ Reliability Risks due to Faults  Affecting their High Performance Microprocessors’ Caches

Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel)

 

3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers

Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens) 

4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness

Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft)

 

3.5 – The Challenges of Operating a Heterogeneous Edge Cluster

Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel)

4.5 – Managing Memory Correctable Error Solutions

Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent)

 

 

 

10:30 – 11:00

Coffee Break

 

 

 

 

11:00 – 12:30

Session 5

AI and RAS

Session 6

Testing and Resilience

 

5.1 – Meta AI Server Reliability Dimensional Analysis

Peng Xiao and Mihir Patel (Meta)

6.1 – Delay Monitoring Under Different PVT Corners

Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens)

 

5.2 – Comprehensive Reliability Analysis in AI systems

Anju John, Matt Bergeron and Mihir Patel (Meta)

6.2 – Timing-Verification Test for Timing Related Defects

Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens)

 

5.3 – Build High Reliability/Availability/Serviceability head node for AI server

Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance)

6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption

Irith Pomeranz and Yervant Zorian (Purdue U and Synopsys)

 

5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions

Xun Jiao, Fred Lin and Harish Dixit (Meta)

6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems

Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech)             

 

5.5 – What does measuring resilience in AI systems entail?

Chitkala Sethuraman (Microsoft)   

 

 

5.6 – Dual Transformer Encoding: Remaining Useful Life Estimation through Channel-Independent and Collective Approach

Paul Nikolian and Fadi Kurdahi (UC Invine)

 

 

 

 

12:30 – 14:00

Networking Lunch

 

Wednesday, 12th June 2024 (Dual Track)

07:00 – 08:00

Breakfast

 

 

 (Speaker Name Highlighted)

 

08:00 – 09:15

Session 1

Data Center RAS 1

Moderator: Bharath Parthasarathy

Session 2

Memory and Interconnects 1

Moderator: Kwabena Boateng

 

1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale

Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta)     

2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers

Shen Zhou, Dahai Zhou, Haoyu Ruan, Zhibing Li, Yi Li, Keke Xie, and Yogesh Varma (Intel and Alibaba)           

 

1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale

Gautham Vunnam, Abishek Hariharan, Sankarnarayanan Gurumurthy, Tejasvi Chakravarthy, Harish Dattatraya Dixit (Meta and AMD)

2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory

Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux)

 

1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization

Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft)

2.3 – Standardized RAS API using CXL Component Command Interface

Shubhada Pugaonkar and Antonio Hasbun Marin (Intel)

 

1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments

Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta)

2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention

Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance)

 

1.5 – Open Compute Project’s Server Resilience Specification 1.0

Thiago Macieira (Intel)

2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters

Anil Agrawal and Bill Holland (Meta)

 

 

 

09:15 – 10:30

Session 3

Data Center RAS 2

Moderator: Harish Dixit

Session 4

Memory and Interconnects 2

Moderator: Sreejit Chakravarty

 

3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors

Thiago Macieira (Intel)

4.1 – RAIDDR: Error Correction for Multi-device Busses

Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft)

 

3.2 – Writing SDE-finding tests using OpenDCDiag

Rohit Agashe and Thiago Macieira (Intel)

4.2 – CXL RAS learnings

Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel)

 

3.3 – Maintaining data integrity during transformation

Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel)

4.3 – Data Centers’ Reliability Risks due to Faults  Affecting their High Performance Microprocessors’ Caches

Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel)

 

3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers

Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens) 

4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness

Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft)

 

3.5 – The Challenges of Operating a Heterogeneous Edge Cluster

Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel)

4.5 – Managing Memory Correctable Error Solutions

Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Zengping Xu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent) 

   

10:30 – 11:00

Coffee Break

 
   

11:00 – 12:30

Session 5

AI and RAS

Moderator: Preeti Chauhan

Session 6

Testing and Resilience

Moderator: Chris Connor

 

5.1 – Meta AI Server Reliability Dimensional Analysis

Peng Xiao and Mihir Patel (Meta)

6.1 – Delay Monitoring Under Different PVT Corners

Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens)

 

5.2 – Comprehensive Reliability Analysis in AI systems

Anju John, Matt Bergeron and Mihir Patel (Meta)

6.2 – Timing-Verification Test for Timing Related Defects

Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens)

 

5.3 – Build High Reliability/Availability/Serviceability head node for AI server

Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance)

6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption

Irith Pomeranz (Purdue U)

 

5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions

Xun Jiao, Fred Lin and Harish Dixit (Meta)

6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems

Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech)             

 

5.5 – What does measuring resilience in AI systems entail?

Chitkala Sethuraman (Microsoft)   

 
 

5.6 – Multi Channel Transformer: Remaining Useful Life Estimation through Channel-Independent and Collective Approach

Paul Nikolian and Fadi Kurdahi (UC Invine)

 
   

12:30 – 14:00

Networking Lunch

 

Registration will be located outside of Sedona

Registration Hours: 7:00 am – 8:30 am 

Tuesday, June 11

All sessions will be held in Sedona

Breakfast & Lunch will be located in Salons 7-9

Reception will be held in the Orchard Lounge 

Wednesday, June 12 (half day session)

Breakout Session #1- Salon A

Breakout Session #2- Salon B 

Breakfast and Lunch will be located in Salons 7-9

Keynote

Corporate Vice President, General Manager, Data Center and AI Product Management, Intel Corporation

Dr. Zane A. Ball is a Corporate Vice President and General Manager of the Data Center and AI (DCAI) Product Management Group. DCAI Product Management is responsible for end-to-end stewardship of DCAI’s systems, SW, CPU, GPU, and custom product line through the entirety of the product lifecycle.  Prior to his product management role, Ball was CVP and GM of platform engineering and architecture for Intel’s data center business.  Ball has also served as Co-GM of Intel’s foundry effort as a VP in the Technology and Manufacturing group and VP of the Client Computing Group including roles as GM of the desktop client business and as GM of global customer engineering.

Ball has a bachelor’s degree, master’s degree, and Ph.D. in electrical engineering, all earned from Rice University.  He also holds six patents in high-speed electrical design.