面向异构硬件的编译器支持降精度与AoS-SoA转换技术 (Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware)

This study evaluates AoS-to-SoA transformations over reduced-precision data layouts for a particle simulation code on several GPU platforms: We hypothesize that SoA fits particularly well to SIMT, while AoS is the preferred storage format for many Lagrangian codes. Reduced-precision (below IEEE accuracy) is an established tool to address bandwidth constraints, although it remains unclear whether AoS and precision conversions should execute on a CPU or be deployed to a GPU if the compute kernel itself must run on an accelerator. On modern superchips where CPUs and GPUs share (logically) one data space, it is also unclear whether it is advantageous to stream data to the accelerator prior to the calculation, or whether we should let the accelerator transform data on demand, i.e.~work in-place logically. We therefore introduce compiler annotations to facilitate such conversions and to give the programmer the option to orchestrate the conversions in combination with GPU offloading. For some of our compute kernels of interest, Nvidia's G200 platforms yield a speedup of around 2.6 while AMD's MI300A exhibits more robust performance yet profits less. We assume that our compiler-based techniques are applicable to a wide variety of Lagrangian codes and beyond.

翻译：本研究评估了在多个GPU平台上针对粒子模拟代码的降精度数据布局的AoS到SoA转换：我们假设SoA特别适合SIMT架构，而AoS是许多拉格朗日代码的首选存储格式。降精度（低于IEEE精度标准）是解决带宽限制的成熟技术，但若计算内核本身必须在加速器上运行，AoS与精度转换应在CPU执行还是部署至GPU仍不明确。在现代CPU与GPU共享（逻辑上）单一数据空间的超级芯片上，亦不确定应在计算前将数据流式传输至加速器，还是让加速器按需就地转换数据（即逻辑上原位操作）。为此，我们引入编译器注解以促进此类转换，并为程序员提供结合GPU卸载的转换编排选项。针对部分目标计算内核，英伟达G200平台实现了约2.6倍的加速比，而AMD MI300A平台虽表现更稳健但收益较小。我们推断这种基于编译器的技术可广泛应用于各类拉格朗日代码及其他领域。