Program Structure – IEEE RAS 2024

Tuesday, 11th June 2024 (SingleTrack)

Timeslot		Topic	Organizer	Moderator	Speaker	Title
7:00 – 8:30		Registration and Breakfast
8:30 – 9:00		Opening Address	Jyotika Athavale BIO, Yervant Zorian BIO, Dimitris Gizopoulos BIO, Amr Haggag (Chairs)			Welcome
9:00 – 10:00	9:00-9:30	Keynote 1	Jyotika Athavale (IEEE Computer Society)		Ankur Garg (Microsoft) BIO	Navigating the AI Era: The Essential Role of RAS ABSTRACT
	9:30-10:00	Keynote 2			Steve Hesley (AMD) BIO	The Vital Role of RAS ABSTRACT
10:00 – 10:30		Coffee Break
10:30 – 12:00		Invited Special Session 1 (SS1 Quality)	Rama Govindaraju (Google)	Rama Govindaraju (Google)	Subhasish Mitra (Stanford) BIO Vilas Sridharan (AMD) BIO Harish Dixit (Meta) BIO David Lerner (Intel) BIO	SDC (ELF and best practices for in-situ screening) A Cambrian Explosion in Robust Computing Systems is Dead Ahead ABSTRACT Addressing emerging fault modes with testing and reliability ABSTRACT Silent Data Corruptions at Scale ABSTRACT Silent Data Errors: Causes, Implications to AI Workloads, and In-Field Mitigation ABSTRACT
12:00 – 13:00		Lunch
13:00 – 14:00	13:00-13:30	Keynote 3	Yervant Zorian (Synopsys)		George Tchaparian (OCP) BIO	Challenges and OCP Community Progress towards Reliability, Availability, and Serviceability (RAS) with a Special Focus on Artificial Intelligence ABSTRACT
	13:30-14:00	Keynote 4			Zane Ball (Intel) BIO	Pioneering Reliability, Availability, and Serviceability in the AI era ABSTRACT
14:00 – 15:30		Invited Special Session 2 (SS2 Reliability)	Rama Bhimanadhuni (Microsoft)	Shawn Blanton (CMU)	Yogesh Varma (Intel) BIO Rama Bhimanadhuni (MSFT) BIO Drew Walton BIO Dimitris Gizopoulos (U of Athens) BIO	Hardware Fault Management Towards Autonomous Hardware Fault Management ABSTRACT Enabling Generative AI- Exploring RAS Requirements for Hyperscale AI Infrastructure ABSTRACT Efforts to Address Fault Management Challenges in OCP ABSTRACT The Role of Abstraction and Modeling in the Assessment of Hardware Faults Effects ABSTRACT
15:30 – 16:00		Coffee Break
16:00 – 17:30		Invited Special Session 3 (SS3 Availability)	Yogesh Varma (Intel)	Cecilia Metra (U of Bologna)	Swadesh Choudhary (Intel) BIO Yervant Zorian (Synopsys) BIO Arijit Biswas (Intel) BIO Sanjay Gongalore (Nvidia) BIO	Pathfinding Toward SoC Self-Healing Architecture UCIe RAS Overview ABSTRACT RAS Challenges & Solution for Today’s Chiplet-based Systems ABSTRACT Pathfinding Toward SoC Self-Healing Architecture ABSTRACT Maximizing Availability for a Zettascale Datacenter ABSTRACT
17:30 – 19:00		Reception with IEEE CS BOG
19:00 – 20:30		Invited Special Session 4 (SS4 Serviceability)	Drew Walton	Drew Walton	John Holm (Intel) BIO Rob Chappell (Microsoft) BIO Anil Agrawal (Meta) BIO Amit Pandey (Amazon) BIO	In-fleet Serviceability Enhancing Computer Serviceability Through Error Telemetry ABSTRACT Challenges in Hyperscale Serviceability ABSTRACT In-band Error Handling requirements for RAS in hyperscale data centers ABSTRACT Addressing Serviceability throughout device lifecycle with High Speed Access for Test ABSTRACT

Wednesday, 12th June 2024 (Dual Track)

07:00 – 08:00	Breakfast

08:00 – 09:15	Session 1 Data Center RAS 1	Session 2 Memory and Interconnects 1
	1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta)	2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers Shen Zhou, Dahai Zhou, Gaoyu Ruan, Zhibing Li, Yi Li and Keke Xie (Intel and Alibaba)
	1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale Tejasvi Chakravarthy, Sankarnarayanan Gurumurthy, Harish Dattatreya Dixit and Abishek Hariharan (Meta and AMD)	2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux)
	1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft)	2.3 – Standardized RAS API using CXL Component Command Interface Shubhada Pugaonkar and Antonio Hasbun Marin (Intel)
	1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta)	2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance)
	1.5 – Open Compute Project’s Server Resilience Specification 1.0 Thiago Macieira (Intel)	2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters Anil Agrawal and Bill Holland (Meta)

09:15 – 10:30	Session 3 Data Center RAS 2	Session 4 Memory and Interconnects 2
	3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors Thiago Macieira (Intel)	4.1 – RAIDDR: Error Correction for Multi-device Busses Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft)
	3.2 – Writing SDE-finding tests using OpenDCDiag Rohit Agashe and Thiago Macieira (Intel)	4.2 – CXL RAS learnings Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel)
	3.3 – Maintaining data integrity during transformation Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel)	4.3 – Data Centers’ Reliability Risks due to Faults Affecting their High Performance Microprocessors’ Caches Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel)
	3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens)	4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft)
	3.5 – The Challenges of Operating a Heterogeneous Edge Cluster Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel)	4.5 – Managing Memory Correctable Error Solutions Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent)

10:30 – 11:00	Coffee Break

11:00 – 12:30	Session 5 AI and RAS	Session 6 Testing and Resilience
	5.1 – Meta AI Server Reliability Dimensional Analysis Peng Xiao and Mihir Patel (Meta)	6.1 – Delay Monitoring Under Different PVT Corners Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens)
	5.2 – Comprehensive Reliability Analysis in AI systems Anju John, Matt Bergeron and Mihir Patel (Meta)	6.2 – Timing-Verification Test for Timing Related Defects Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens)
	5.3 – Build High Reliability/Availability/Serviceability head node for AI server Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance)	6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption Irith Pomeranz and Yervant Zorian (Purdue U and Synopsys)
	5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions Xun Jiao, Fred Lin and Harish Dixit (Meta)	6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech)
	5.5 – What does measuring resilience in AI systems entail? Chitkala Sethuraman (Microsoft)
	5.6 – Dual Transformer Encoding: Remaining Useful Life Estimation through Channel-Independent and Collective Approach Paul Nikolian and Fadi Kurdahi (UC Invine)

12:30 – 14:00	Networking Lunch

Wednesday, 12th June 2024 (Dual Track)

07:00 – 08:00	Breakfast
	(Speaker Name Highlighted)
08:00 – 09:15	Session 1 Data Center RAS 1 Moderator: Bharath Parthasarathy	Session 2 Memory and Interconnects 1 Moderator: Kwabena Boateng
	1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale Shubhada Sahasrabudhe, Harish Dixit, *David Lerner*, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta)	2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers Shen Zhou, Dahai Zhou, Haoyu Ruan, Zhibing Li, Yi Li, Keke Xie, and *Yogesh Varma* (Intel and Alibaba)
	1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale Gautham Vunnam, *Abishek Hariharan*, Sankarnarayanan Gurumurthy, Tejasvi Chakravarthy, Harish Dattatraya Dixit (Meta and AMD)	2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux)
	1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization *Tulika Jha*, Bob Krick, John Lee and Saurabh Agrawal (Microsoft)	2.3 – Standardized RAS API using CXL Component Command Interface *Shubhada Pugaonkar* and Antonio Hasbun Marin (Intel)
	1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments *Vineet Parekh*, Suman Gumudaveli and Venkat Ramesh (Meta)	2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention Shen Zhou, Yu Zhang, *Chenchen Li*, Linlin Han and Feng Xu (Intel and ByteDance)
	1.5 – Open Compute Project’s Server Resilience Specification 1.0 *Thiago Macieira* (Intel)	2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters Anil Agrawal and *Bill Holland* (Meta)

09:15 – 10:30	Session 3 Data Center RAS 2 Moderator: Harish Dixit	Session 4 Memory and Interconnects 2 Moderator: Sreejit Chakravarty
	3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors *Thiago Macieira* (Intel)	4.1 – RAIDDR: Error Correction for Multi-device Busses *Majid Nemati*, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft)
	3.2 – Writing SDE-finding tests using OpenDCDiag *Rohit Agashe* and Thiago Macieira (Intel)	4.2 – CXL RAS learnings *Manjunaatha Harapanahalli*, Erwin Tsaur and Mahesh Natu (Intel)
	3.3 – Maintaining data integrity during transformation Smita Kumar, Patrick Fleming, Gordon McFadden and *Sailesh Bissessur* (Intel)	4.3 – Data Centers’ Reliability Risks due to Faults Affecting their High Performance Microprocessors’ Caches Martin Omana, Annalisa Manfredi, *Cecilia Metra*, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel)
	3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers *Dimitris Gizopoulos*, George Papadimitriou and Odysseas Chatzopoulos (U Athens)	4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness *Hoiju Chung*, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft)
	3.5 – The Challenges of Operating a Heterogeneous Edge Cluster *Nicolas Oliver*, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel)	4.5 – Managing Memory Correctable Error Solutions Shawn Fan, Alex Zhou, Eric Li, Annie Yu, *Zengping Xu*, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent)

10:30 – 11:00	Coffee Break

11:00 – 12:30	Session 5 AI and RAS Moderator: Preeti Chauhan	Session 6 Testing and Resilience Moderator: Chris Connor
	5.1 – Meta AI Server Reliability Dimensional Analysis *Peng Xiao* and Mihir Patel (Meta)	6.1 – Delay Monitoring Under Different PVT Corners Hari Addepalli, Jiezhong Wu, *Nilanjan Mukherjee*, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens)
	5.2 – Comprehensive Reliability Analysis in AI systems *Anju John*, Matt Bergeron and Mihir Patel (Meta)	6.2 – Timing-Verification Test for Timing Related Defects Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and *Janusz Rajski* (Purdue U and Siemens)
	5.3 – Build High Reliability/Availability/Serviceability head node for AI server Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, *Shijian Ge*, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance)	6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption *Irith Pomeranz* (Purdue U)
	5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions *Xun Jiao*, Fred Lin and Harish Dixit (Meta)	6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems *Hanqiu Chen*, Zishen Wan and Cong Hao (Georgia Tech)
	5.5 – What does measuring resilience in AI systems entail? *Chitkala Sethuraman* (Microsoft)
	5.6 – Multi Channel Transformer: Remaining Useful Life Estimation through Channel-Independent and Collective Approach *Paul Nikolian* and Fadi Kurdahi (UC Invine)

12:30 – 14:00	Networking Lunch

Registration will be located outside of Sedona

Registration Hours: 7:00 am – 8:30 am

Tuesday, June 11

All sessions will be held in Sedona

Breakfast & Lunch will be located in Salons 7-9

Reception will be held in the Orchard Lounge

Wednesday, June 12 (half day session)

Breakout Session #1- Salon A

Breakout Session #2- Salon B

Breakfast and Lunch will be located in Salons 7-9

Tuesday, 11th June 2024 (SingleTrack)

Wednesday, 12th June 2024 (Dual Track)

Wednesday, 12th June 2024 (Dual Track)

Keynote

Corporate Vice President, General Manager, Data Center and AI Product Management, Intel Corporation