跨性能可移植框架实现多GPU科学计算小型应用 (Implementing Multi-GPU Scientific Computing Miniapps Across Performance Portable Frameworks)

Scientific computing in the exascale era demands increased computational power to solve complex problems across various domains. With the rise of heterogeneous computing architectures the need for vendor-agnostic, performance portability frameworks has been highlighted. Libraries like Kokkos have become essential for enabling high-performance computing applications to execute efficiently across different hardware platforms with minimal code changes. In this direction, this paper presents preliminary time-to-solution results for two representative scientific computing applications: an N-body simulation and a structured grid simulation. Both applications used a distributed memory approach and hardware acceleration through four performance portability frameworks: Kokkos, OpenMP, RAJA, and OCCA. Experiments conducted on a single node of the Polaris supercomputer using four NVIDIA A100 GPUs revealed significant performance variability among frameworks. OCCA demonstrated faster execution times for small-scale validation problems, likely due to JIT compilation, however its lack of optimized reduction algorithms may limit scalability for larger simulations while using its out of the box API. OpenMP performed poorly in the structured grid simulation most likely due to inefficiencies in inter-node data synchronization and communication. These findings highlight the need for further optimization to maximize each framework's capabilities. Future work will focus on enhancing reduction algorithms, data communication, memory management, as wells as performing scalability studies, and a comprehensive statistical analysis to evaluate and compare framework performance.

翻译：在百亿亿次计算时代，科学计算需要更强的计算能力以解决各领域的复杂问题。随着异构计算架构的兴起，对厂商无关的性能可移植框架的需求日益凸显。Kokkos等库已成为高性能计算应用在不同硬件平台上高效运行的关键工具，仅需极少的代码修改。基于此，本文展示了两个代表性科学计算应用的初步求解时间结果：N体模拟与结构化网格模拟。两个应用均采用分布式内存方法，并通过四种性能可移植框架实现硬件加速：Kokkos、OpenMP、RAJA和OCCA。在Polaris超级计算机单节点上使用四块NVIDIA A100 GPU进行的实验显示，各框架间存在显著的性能差异。OCCA在小规模验证问题上表现出更快的执行时间，这可能是由于即时编译（JIT）的优势，但其缺乏优化的归约算法可能限制使用其默认API进行大规模模拟时的可扩展性。OpenMP在结构化网格模拟中表现不佳，很可能源于节点间数据同步与通信的效率低下。这些发现凸显了进一步优化以最大化各框架性能的必要性。未来工作将聚焦于改进归约算法、数据通信、内存管理，并进行可扩展性研究及全面的统计分析，以评估和比较框架性能。