HPDC'22 Keynote Speakers
Keynote I: HPDC Achievement Award: Franck Cappello (Argonne National Lab)
9:30 - 10:30 am, June 27, 2022 - Johnson Great Room
A Reflection on Methodologies, Algorithms, and Software for HPDC
The core of the HPDC research and the main scientific contributions push forward the state of the art in how to program and execute parallel and distributed applications and analysis faster, more efficiently, more reliably, and in a more secure way. Researchers in our domain are facing two important challenges when transforming an idea into an innovative algorithm and an innovative algorithm into a successful software: evaluation methodology and software engineering. These challenges arise, in particular, every time a new branch of HPDC research topic emerges. In this talk, I will discuss the relation between methodology, algorithms, and software for HPDC. I will specifically focus on three domains: experimental platforms for HPDC research, fault tolerance at extreme scale, and lossy compression for scientific data. For these three domains I will review the motivations behind the research, the current situation, and what I see as the potential next directions.
Bio: Franck Cappello is senior computer scientist and R&D lead at Argonne National laboratory. Franck initiated the Grid’5000 project (https://www.grid5000.fr) in 2003. Grid’5000 is a large-scale experimental platform for high performance distributed computing research. He served as Director of Grid’5000 in its design, implementation and production phases from 2003 to 2008. Grid’5000 is still in used today and has helped hundreds or researchers for their experiments in parallel and distributed computing. Franck started working on fault tolerance for high performance scientific computing more than 20 year ago. With his students and collaborators, he explored many different and complementary aspects of resilience and fault tolerance for scientific computing at extreme scale producing about 100 international publications on this domain. He led the resilience topic of the International Exascale Software Project and the European Exascale Software Initiative. Franck leads the VeloC innovative asynchronous multi-level checkpointing project funded by the U.S. Exascale Computing Project (ECP) that will serve applications running on the US exascale systems. From 2016 and with the support of ECP, he started exploring lossy compression for scientific computing to address the increasing discrepancy between scientific application data set sizes and the capacities of HPC storage infrastructures. This research produced the SZ lossy compressor, the Z-checker tool to assess the nature of lossy compression errors and the SDRbench repository of reference scientific datasets. Franck is IEEE fellow and recipient of two prestigious R&D100 awards, the 2018 IEEE TCPP Outstanding Service Award and the 2021 IEEE Transactions in Computers Award for Editorial Service and Excellence.
Keynote II: Sudhanva Gurumurthi (AMD)
9:00 - 10:00 am, June 28, 2022 - Johnson Great Room
Heterogeneous Systems Resilience: From Research to Industry Standards
Reliability is a fundamental computing abstraction. This abstraction is increasingly challenging to achieve at high node-level component densities and for large compute infrastructures. Industry standards have played a key role in enabling such scaling, by facilitating greater heterogeneity, tighter integration of compute and memory, and paving the way for new node and system architectures. Therefore, Reliability, Availability, and Serviceability (RAS) techniques that enhance resilience and intercept major industry standards are beneficial to the overall ecosystem that use these standards.
We first explain why RAS is important for large-scale systems and outline some key best practices in servers. We then present insights from studying reliability field data from production systems and provide an overview of tools and techniques developed to enhance resiliency and reliability. Finally, we show how the research influenced the RAS architecture and capabilities of two recently announced industry standards and their potential resilience benefits at scale.
Bio: Sudhanva Gurumurthi is a Principal Member of the Technical Staff at AMD, where he leads advanced development in RAS. Additionally, he serves on the Dean’s Advisory Council of the College of Science and Engineering at Texas State University. Prior to joining industry, Sudhanva was an Associate Professor with tenure in the Computer Science Department at the University of Virginia. He is a recipient of an NSF CAREER Award, a Google Focused Research Award, an IEEE Computer Society Distinguished Contributor recognition, and several other awards and recognitions. Sudhanva received his PhD in Computer Science and Engineering from Penn State in 2005.
Keynote III: Manish Parashar (University of Utah)
9:00 - 10:00 am, June 29, 2022 - Johnson Great Room
Data-Management for Extreme Science: Experiences in Translational Computer Science Research
Extreme scale computing and data have become essential to computational and data-enabled science and engineering, promising dramatic new insights into natural and engineered systems. However, data-management challenges continue to limit the potential impact of extreme-scale application workflows. In this talk I will present my translational computer science (TCS) research experiences in addressing these challenges. TCS research bridges foundational/use-inspired research with the delivery and deployment of outcomes to the target community and supports the essential bi-direction interplays. Specifically, I will explore data sharing abstractions, managed data pipelines, data-staging services, and in-situ/in-transit data placement and processing to support extreme scale in-situ workflows.
Bio: Manish Parashar is Director of the Scientific Computing and Imaging (SCI) Institute, Chair in Computational Science and Engineering, and Professor, School of Computing at the University of Utah. He is currently on an IPA appointment at the National Science Foundation where he is serving as Office Director of the NSF Office of Advanced Cyberinfrastructure. His research interests are in the broad areas of Parallel and Distributed Computing and Computational and Data-Enabled Science and Engineering. Manish is the founding chair of the IEEE Technical Consortium on High Performance Computing (TCHPC), Editor-in-Chief of the IEEE Transactions on Parallel and Distributed Systems. He is Fellow of AAAS, ACM, and IEEE/IEEE Computer Society. For more information, please visit http://manishparashar.org.