IEEE-RAS-in-Data-Centers-Summit-logo
IEEE-RAS-in-Data-Centers-Summit-logo

Tuesday, 11th June 2024 (SingleTrack)

Timeslot

Topic

Organizer

Moderator

Speaker

Title

7:00 – 8:30

Registration and Breakfast

8:30 – 9:00

Opening Address

Jyotika Athavale, Yervant Zorian, Dimitris Gizopoulos, Amr Haggag (Chairs)

Welcome

9:00 – 10:00

9:00-9:30

Keynote 1

Rama Bhimanadhuni (Microsoft)

Jyotika Athavale (IEEE Computer Society)

Ankur Garg (Microsoft)

9:30-10:00

Keynote 2

Sankar Gurumurthy (AMD)

Steve Hesley (AMD)

10:00 – 10:30

Coffee Break

10:30 – 12:00

Invited Special Session 1 

(SS1 Quality)

Rama Govindaraju (Google)

Rama Govindaraju (Google)

Subhasish Mitra (Stanford), Vilas Sridharan (AMD), Harish/Sriram (Meta), David L/Thiago (Intel)

SDC (ELF and best practices for in-situ screening)

12:00 – 13:00

Lunch

13:00 – 14:00

13:00-13:30

Keynote 3

Yervant Zorian (Synopsys)

Yervant Zorian (Synopsys)

George Tchaparian (OCP)

13:30-14:00

Keynote 4

Chris Connor (Intel)

Zane Ball (Intel)

14:00 – 15:30

Invited Special Session 2 

(SS2 Reliability)

Rama Bhimanadhuni (Microsoft)

Shawn Blanton (CMU)

Yogesh Varma (Intel), Rama Bhimanadhuni (MSFT), Drew Walton (Google), Dimitris Gizopoulos (U of Athens)

Hardware Fault Management

15:30 – 16:00

Coffee Break

16:00 – 17:30

Invited Special Session 3 

(SS3 Availability)

Yogesh Varma (Intel)

Cecilia Metra (U of Bologna)

Debendra Das Sharma (Intel), Yervant Zorian (Synopsys), Arijit Biswas (Intel), Sanjay Gongalore (Nvidia)

Pathfinding Toward SoC Self-Healing Architecture

17:30 – 19:00

Reception with IEEE CS BOG

19:00 – 20:30

Invited Special Session 4 

(SS4 Serviceability)

Drew Walton (Google)

Drew Walton (Google)

John Holm (Intel), Rob Chapman (Microsoft), Anil Agrawal (Meta), Amit Pandey (Amazon)

In-fleet Serviceability 

Wednesday, 12th June 2024 (Dual Track)

07:00 – 08:00

Breakfast

 

 

 

 

08:00 – 09:15

Session 1
Data Center RAS 1

Session 2

Memory and Interconnects 1

 

1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale

Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta)     

2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers

Shen Zhou, Dahai Zhou, Gaoyu Ruan, Zhibing Li, Yi Li and Keke Xie (Intel and Alibaba)           

 

1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale

Tejasvi Chakravarthy, Sankarnarayanan Gurumurthy, Harish Dattatreya Dixit and Abishek Hariharan (Meta and AMD)

2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory

Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux)

 

1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization

Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft)

2.3 – Standardized RAS API using CXL Component Command Interface

Shubhada Pugaonkar and Antonio Hasbun Marin (Intel)

 

1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments

Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta)

2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention

Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance)

 

1.5 – Open Compute Project’s Server Resilience Specification 1.0

Thiago Macieira (Intel)

2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters

Anil Agrawal and Bill Holland (Meta)

 

 

 

09:15 – 10:30

Session 3

Data Center RAS 2

Session 4

Memory and Interconnects 2

 

3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors

Thiago Macieira (Intel)

4.1 – RAIDDR: Error Correction for Multi-device Busses

Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft)

 

3.2 – Writing SDE-finding tests using OpenDCDiag

Rohit Agashe and Thiago Macieira (Intel)

4.2 – CXL RAS learnings

Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel)

 

3.3 – Maintaining data integrity during transformation

Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel)

4.3 – Data Centers’ Reliability Risks due to Faults  Affecting their High Performance Microprocessors’ Caches

Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel)

 

3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers

Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens) 

4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness

Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft)

 

3.5 – The Challenges of Operating a Heterogeneous Edge Cluster

Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel)

4.5 – Managing Memory Correctable Error Solutions

Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent)

 

 

 

10:30 – 11:00

Coffee Break

 

 

 

 

11:00 – 12:30

Session 5

AI and RAS

Session 6

Testing and Resilience

 

5.1 – Meta AI Server Reliability Dimensional Analysis

Peng Xiao and Mihir Patel (Meta)

6.1 – Delay Monitoring Under Different PVT Corners

Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens)

 

5.2 – Comprehensive Reliability Analysis in AI systems

Anju John, Matt Bergeron and Mihir Patel (Meta)

6.2 – Timing-Verification Test for Timing Related Defects

Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens)

 

5.3 – Build High Reliability/Availability/Serviceability head node for AI server

Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance)

6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption

Irith Pomeranz and Yervant Zorian (Purdue U and Synopsys)

 

5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions

Xun Jiao, Fred Lin and Harish Dixit (Meta)

6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems

Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech)             

 

5.5 – What does measuring resilience in AI systems entail?

Chitkala Sethuraman (Microsoft)   

 

 

5.6 – Dual Transformer Encoding: Remaining Useful Life Estimation through Channel-Independent and Collective Approach

Paul Nikolian and Fadi Kurdahi (UC Invine)

 

 

 

 

12:30 – 14:00

Networking Lunch

 

Wednesday, 12th June 2024 (Dual Track)

Timeslot Topic Moderator
7:00 – 8:00 Registration and Breakfast
8:00 – 9:30 2 Parallel Tracks of Technical Presentations Dimitris Gizopoulos
9:30-11:00 2 Parallel Tracks of Technical Presentations Amr Haggag
11:00-11:30 Coffee Break
11:30 – 13:00 2 Parallel Tracks of Technical Presentations Kwabena Boateng
13:00 – 14:00 Networking Lunch