MANA-2.0:在规模上透明检查MPI的未来设计图 (MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale)

MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency ("network-agnostic") feature ensures that MANA-2.0 will provide a viable, efficient mechanism for transparently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition.

翻译：MANA-2.0是一个可扩缩的、未来无法对MPI计算进行透明检查的透明性未来设计。它的网络透明度(“网络-不可知性”)功能确保MANA-2.0将提供一个可行、高效的机制,透明地检查目前和未来超级计算机的MPI应用。MANA-2.0是以前工作的强化,最初的MANA将MPI调用电话,并且是准备进行生产部署的一项工作。MANA-2.0实施了一系列新的算法和特征,改进MANA的可扩缩性和可靠性,使透明的检查站能够重新启动数千个MPI进程。MANA-2.0正在利用Cray MPICH图书馆在Cray GNI网络上对NERSC的Cori超级计算机进行测试,但旨在在任意网络上处理任何标准的MPI。选择了两种广泛使用的HPC应用程序,以展示MANA-2.0的强化特征:GROMAACS,一种分子动态模拟代码,经常进行点对点通信,使数千个MPI-OVA,一种材料能够使MAIS-S-S-CRA数据库向最重要的数据库转换。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

面向大数据存储的大型元数据服务器的研究，A Survey on Large Scale Metadata Server for Big Data Storage

专知会员服务

9+阅读 · 2020年5月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning

专知会员服务

124+阅读 · 2019年12月23日