Timeslot |
Topic |
Organizer |
Moderator |
Speaker |
Title |
||
7:00 – 8:30 |
Registration and Breakfast |
||||||
8:30 – 9:00 |
Opening Address |
Jyotika Athavale, Yervant Zorian, Dimitris Gizopoulos, Amr Haggag (Chairs) |
Welcome |
||||
9:00 – 10:00 |
9:00-9:30 |
Keynote 1 |
Rama Bhimanadhuni (Microsoft) |
Jyotika Athavale (IEEE Computer Society) |
Ankur Garg (Microsoft) |
||
9:30-10:00 |
Keynote 2 |
Sankar Gurumurthy (AMD) |
Steve Hesley (AMD) |
||||
10:00 – 10:30 |
Coffee Break |
||||||
10:30 – 12:00 |
Invited Special Session 1 (SS1 Quality) |
Rama Govindaraju (Google) |
Rama Govindaraju (Google) |
Subhasish Mitra (Stanford), Vilas Sridharan (AMD), Harish/Sriram (Meta), David L/Thiago (Intel) |
SDC (ELF and best practices for in-situ screening) |
||
12:00 – 13:00 |
Lunch |
||||||
13:00 – 14:00 |
13:00-13:30 |
Keynote 3 |
Yervant Zorian (Synopsys) |
Yervant Zorian (Synopsys) |
George Tchaparian (OCP) |
||
13:30-14:00 |
Keynote 4 |
Chris Connor (Intel) |
Zane Ball (Intel) |
||||
14:00 – 15:30 |
Invited Special Session 2 (SS2 Reliability) |
Rama Bhimanadhuni (Microsoft) |
Shawn Blanton (CMU) |
Yogesh Varma (Intel), Rama Bhimanadhuni (MSFT), Drew Walton (Google), Dimitris Gizopoulos (U of Athens) |
Hardware Fault Management |
||
15:30 – 16:00 |
Coffee Break |
||||||
16:00 – 17:30 |
Invited Special Session 3 (SS3 Availability) |
Yogesh Varma (Intel) |
Cecilia Metra (U of Bologna) |
Debendra Das Sharma (Intel), Yervant Zorian (Synopsys), Arijit Biswas (Intel), Sanjay Gongalore (Nvidia) |
Pathfinding Toward SoC Self-Healing Architecture |
||
17:30 – 19:00 |
Reception with IEEE CS BOG |
||||||
19:00 – 20:30 |
Invited Special Session 4 (SS4 Serviceability) |
Drew Walton (Google) |
Drew Walton (Google) |
John Holm (Intel), Rob Chapman (Microsoft), Anil Agrawal (Meta), Amit Pandey (Amazon) |
In-fleet Serviceability |
07:00 – 08:00 | Breakfast |
|
|
|
|
08:00 – 09:15 | Session 1 | Session 2 Memory and Interconnects 1 |
| 1.1 – Silent Data Corruption – Intel-Meta joint collaboration to detect and mitigate at-scale Shubhada Sahasrabudhe, Harish Dixit, David Lerner, Tejasvi Chakravarthy, Thiago Maceira, Matt Beadon, Sriram Sankar and Ethan Hansen (Intel and Meta) | 2.1 – AI in BMC: Improving DDR5 Memory Reliability in Hyperscale Data Centers Shen Zhou, Dahai Zhou, Gaoyu Ruan, Zhibing Li, Yi Li and Keke Xie (Intel and Alibaba) |
| 1.2 – Silent Data Corruption – Meta-AMD silent error collaboration for screening efficiency at-scale Tejasvi Chakravarthy, Sankarnarayanan Gurumurthy, Harish Dattatreya Dixit and Abishek Hariharan (Meta and AMD) | 2.2 – The Future is Now: Empowering DRAM ECC through a Forgotten Coding Theory Kelly Fitzpatrick, Saeed Raja, Yang Liu and Tong Zhang (ScaleFlux) |
| 1.3 – RAS Significance and Challenges in Hyper-Scalar Data Centers with Need for Industry Standardization Tulika Jha, Bob Krick, John Lee and Saurabh Agrawal (Microsoft) | 2.3 – Standardized RAS API using CXL Component Command Interface Shubhada Pugaonkar and Antonio Hasbun Marin (Intel) |
| 1.4 – Innovative Approaches to Solving Flash-Induced Latencies in Hyperscale Environments Vineet Parekh, Suman Gumudaveli and Venkat Ramesh (Meta) | 2.4 – Reducing Memory Errors On-the-fly with Prediction-Guided Failure Prevention Shen Zhou, Yu Zhang, Chenchen Li, Linlin Han and Feng Xu (Intel and ByteDance) |
| 1.5 – Open Compute Project’s Server Resilience Specification 1.0 Thiago Macieira (Intel) | 2.5 – PCIe Error Handling Challenges in building AI/ML systems in hyperscale datacenters Anil Agrawal and Bill Holland (Meta) |
|
|
|
09:15 – 10:30 | Session 3 Data Center RAS 2 | Session 4 Memory and Interconnects 2 |
| 3.1 – OpenDCDiag: A Scalable Open-Source Solution to Search for Silent Data Errors Thiago Macieira (Intel) | 4.1 – RAIDDR: Error Correction for Multi-device Busses Majid Nemati, Terry Grunzke, Brett Dodds and Adam Grenzebach (Microsoft) |
| 3.2 – Writing SDE-finding tests using OpenDCDiag Rohit Agashe and Thiago Macieira (Intel) | 4.2 – CXL RAS learnings Manjunaatha Harapanahalli, Erwin Tsaur and Mahesh Natu (Intel) |
| 3.3 – Maintaining data integrity during transformation Smita Kumar, Patrick Fleming, Gordon McFadden and Sailesh Bissessur (Intel) | 4.3 – Data Centers’ Reliability Risks due to Faults Affecting their High Performance Microprocessors’ Caches Martin Omana, Annalisa Manfredi, Cecilia Metra, Riccardo Locatelli, Monia Chiavacci and Stefano Petrucci (U Bologna, Intel) |
| 3.4 – Microarchitectural Modeling of Modern CPUs for SDCs Prediction in Data Centers Dimitris Gizopoulos, George Papadimitriou and Odysseas Chatzopoulos (U Athens) | 4.4 – The Management Era: Predictive DRAM Fault Analysis with Architecture Awareness Hoiju Chung, Yongjun Lee, Woongju Jang, Euisang Oh, Sanghwan Lee, Paul Fahey, Kijoong Choi, Arhatha Bramhanand and Brett Dodds (SK Hynix and Microsoft) |
| 3.5 – The Challenges of Operating a Heterogeneous Edge Cluster Nicolas Oliver, Rajkumar Patel, Dean Throop and Mrinal Karvir (Intel) | 4.5 – Managing Memory Correctable Error Solutions Shawn Fan, Alex Zhou, Eric Li, Annie Yu, Taniya Siddiqua, Kaushik Balasubramanian, Fang Yuan and Xiaoguo Liang (Intel and Tencent) |
|
|
|
10:30 – 11:00 | Coffee Break |
|
|
|
|
11:00 – 12:30 | Session 5 AI and RAS | Session 6 Testing and Resilience |
| 5.1 – Meta AI Server Reliability Dimensional Analysis Peng Xiao and Mihir Patel (Meta) | 6.1 – Delay Monitoring Under Different PVT Corners Hari Addepalli, Jiezhong Wu, Nilanjan Mukherjee, Irith Pomeranz and Janusz Rajski (Purdue U and Siemens) |
| 5.2 – Comprehensive Reliability Analysis in AI systems Anju John, Matt Bergeron and Mihir Patel (Meta) | 6.2 – Timing-Verification Test for Timing Related Defects Jiezhong Wu, Hari Addepalli, Nilanjan Mukherjee, Irith Pomeranz, Kun-Han Tsai and Janusz Rajski (Purdue U and Siemens) |
| 5.3 – Build High Reliability/Availability/Serviceability head node for AI server Alex Zhou, Yu Zhang, Chenchen Li, Fang Yuan, Shijian Ge, Albert Hu, Liang Peng, Antonio J Hasbun Marin and Shawn Fan (Intel and ByteDance) | 6.3 – A Functionally-Aware Scan-Based Test Solution for Silent Data Corruption Irith Pomeranz and Yervant Zorian (Purdue U and Synopsys) |
| 5.4 – PVF (Parameter Vulnerability Factor): A Scalable Metric to Quantify AI Vulnerability to Parameter Corruptions Xun Jiao, Fred Lin and Harish Dixit (Meta) | 6.4 – ResGNN: A Generic Framework for Measuring Graph Neural Network Resilience Against Faults and Attacks in Hardware Systems Hanqiu Chen, Zishen Wan and Cong Hao (Georgia Tech) |
| 5.5 – What does measuring resilience in AI systems entail? Chitkala Sethuraman (Microsoft) |
|
| 5.6 – Dual Transformer Encoding: Remaining Useful Life Estimation through Channel-Independent and Collective Approach Paul Nikolian and Fadi Kurdahi (UC Invine) |
|
|
|
|
12:30 – 14:00 | Networking Lunch |
|
Timeslot | Topic | Moderator |
---|---|---|
7:00 – 8:00 | Registration and Breakfast | |
8:00 – 9:30 | 2 Parallel Tracks of Technical Presentations | Dimitris Gizopoulos |
9:30-11:00 | 2 Parallel Tracks of Technical Presentations | Amr Haggag |
11:00-11:30 | Coffee Break | |
11:30 – 13:00 | 2 Parallel Tracks of Technical Presentations | Kwabena Boateng |
13:00 – 14:00 | Networking Lunch |