{"id":218,"date":"2024-05-28T02:20:01","date_gmt":"2024-05-28T02:20:01","guid":{"rendered":"https:\/\/ieee-ras.conferences.computer.org\/2024\/?page_id=218"},"modified":"2024-05-28T02:26:00","modified_gmt":"2024-05-28T02:26:00","slug":"invited_talk_arijit_biswas_abstract","status":"publish","type":"page","link":"https:\/\/ieee-ras.conferences.computer.org\/2024\/invited_talk_arijit_biswas_abstract\/","title":{"rendered":"Invited_talk_Arijit_Biswas_Abstract"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-page\" data-elementor-id=\"218\" class=\"elementor elementor-218\" data-elementor-post-type=\"page\">\n\t\t\t\t<div class=\"elementor-element elementor-element-730aa2a e-flex e-con-boxed e-con e-parent\" data-id=\"730aa2a\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-4db0d21 elementor-widget elementor-widget-text-editor\" data-id=\"4db0d21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Title:\u00a0<b>Pathfinding Toward a Self-Healing Architecture in Data Centers<\/b><\/p><p class=\"x_MsoNormal\">Speaker:\u00a0<strong>Arijit Biswas\u00a0<\/strong><\/p><p class=\"x_MsoNormal\"><strong>Abstract:<\/strong><\/p><p>Muti-process, chiplet based System-on-Chip datacenter architectures are dominating xPU designs and integrate various parallel processing accelerators with more traditional execution pipelines.\u00a0 Due to very high complexity of such SoCs, and associated costs to manufacture, package, and assure their quality, new architectures will require novel RAS approaches that are cost-effective, adaptable and re-configurable for diversified target workloads such as AI\/ML, HPC,\u00a0General Computing, Low-Latency Processing, Graphics, Cloud and Edge, Communications and Embedded markets.\u00a0\u00a0\u00a0 Additionally, system level integration of such functions provides both challenges and opportunities to optimize and target the solution space based on usages &amp; workloads at the system, node, rack or data center levels to enable better total costs of ownership and value customizations.<\/p><p>Continuous improvement in the quality &amp; reliability space includes various improvements across manufacturing, test &amp; product development.\u00a0 From a technology perspective this involves new technologies and methods for fault detection, system diagnostics &amp; error recovery.\u00a0 This level of reliability is a critical requirement to ensure that a wide variety of usages of such products, ranging from contractual requirements to life safety and even ensuring security, are successful. This means that a wide spectrum of reliability options will dictate solutions spaces where redundancy cost, time-to-market and sustainability are primary driving factors behind those choices and solutions.\u00a0\u00a0<\/p><p>High reliability computing architectures have traditionally focused on some aspect of reliability, availability, reconfigurability, diagnostics, prognostics, and various related verification &amp; validation processes\/methodologies (simulation\/formal) as ways to assure continuation of computing services despite internal SoC errors or failures that affect some part of SoC logic.\u00a0 In our methodology, we take a different direction by first establishing the key pillars of a self-healing architecture, describe their attributes and then follow with individual methods to support those pillars.\u00a0<\/p><p>Our methodology proposes (run-time) detection, seamless diagnostics, and recovery\/failover as the 3 key pillars of the self-healing architecture \u2013 all bound together with clear interfaces and configurability in order to be able to operate as a virtuous cycle in an automated fashion to detect, diagnose and recover from various faults in the field without needing customer intervention.\u00a0 Detection incorporates both run-time detection capabilities as well as coverage and error containment.\u00a0 Diagnostics include the ability to run high coverage test content, including stress content, either at boot time or seamlessly during run-time \u2013 triggered either by detection or the user \u2013 at a granularity that matches with available recovery mechanisms.\u00a0 Finally, recovery incorporates a variety of mechanisms across a broad range of the data center system stack to enable the right level of recovery that enables high uptimes and availability.<\/p><p>Further, we determine that such an architecture must provide configurability via a set of unified control interfaces that allow (seamless) adjustment of system RAS capabilities by users; scalability to allow substantial change in support of partial or full RAS capabilities based on user\u2019s need, where the primary factors are the key usages, the size of datacenter systems (rack, module, blade) and its required protection granularity.\u00a0 It is important to note that in a self-healing architecture, support of multiple pillars is required and those can be linked.\u00a0<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Title:\u00a0Pathfinding Toward a Self-Healing Architecture in Data Centers Speaker:\u00a0Arijit Biswas\u00a0 Abstract: Muti-process, chiplet based System-on-Chip datacenter architectures are dominating xPU designs and integrate various parallel processing accelerators with more traditional execution pipelines.\u00a0 Due to very high complexity of such SoCs, and associated costs to manufacture, package, and assure their quality, new architectures will require novel [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"elementor_canvas","meta":{"footnotes":""},"class_list":["post-218","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/pages\/218","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/comments?post=218"}],"version-history":[{"count":0,"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/pages\/218\/revisions"}],"wp:attachment":[{"href":"https:\/\/ieee-ras.conferences.computer.org\/2024\/wp-json\/wp\/v2\/media?parent=218"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}