Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads the company’s R&D team. His current work focuses on developing scalable efficient communication libraries, checkpointing and restart libraries, and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems and clouds. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 granted US patents and has published more than 30 technical papers or book chapters. He has a Ph.D. degree in computer science from The Ohio State University.
This talk will focus on high-performance and scalable middleware for MPI and DL applications on the OpenPOWER platform. The focus will be on three products with commercial support being available from X-ScaleSolutions. The first product focuses on the OSU MVAPICH2 MPI libraries and their capabilities for high-performance computing with both CPUs (OpenPOWER) and GPUs (NVIDIA). The second product focuses on tight integration between the OSU MVAPICH2-GDR MPI library and the Horovod stack to provide high-performance and scalable Deep Learning (DL) with deep introspection (DI) capabilities for DL frameworks like TensorFlow and PyTorch. The DI capabilities allow DL users and runtime developers to easily optimize their DL applications on modern systems. The third product focuses on a high-performance and scalable checkpointing library for HPC and DL applications. Performance results from the ORNL SUMMIT system (#2nd) and Lassen (#20th) with thousands of GPUs and POWER9 CPUs will be presented.