The growing disparity between computational power and on-chip communication bandwidth is a critical bottleneck in modern Systems-on-Chip (SoCs), especially for data-parallel workloads like AI. Efficient point-to-multipoint (P2MP) data movement, such as multicast, is essential for high performance. However, native multicast support is lacking in standard interconnect protocols. Existing P2MP solutions, such as multicast-capable Network-on-Chip (NoC), impose additional overhead to the network hardware and require modifications to the interconnect protocol, compromising scalability and compatibility. This paper introduces Torrent, a novel distributed DMA architecture that enables efficient P2MP data transfers without modifying NoC hardware and interconnect protocol. Torrent conducts P2MP data transfers by forming logical chains over the NoC, where the data traverses through targeted destinations resembling a linked list. This Chainwrite mechanism preserves the P2P nature of every data transfer while enabling flexible data transfers to an unlimited number of destinations. To optimize the performance and energy consumption of Chainwrite, two scheduling algorithms are developed to determine the optimal chain order based on NoC topology. Our RTL and FPGA prototype evaluations using both synthetic and real workloads demonstrate significant advantages in performance, flexibility, and scalability over network-layer multicast. Compared to the unicast baseline, Torrent achieves up to a 7.88x speedup. ASIC synthesis on 16nm technology confirms the architecture's minimal footprint in area (1.2%) and power (2.3%). Thanks to the Chainwrite, Torrent delivers scalable P2MP data transfers with a small cycle overhead of 82CC and area overhead of 207um2 per destination.
翻译:计算能力与片上通信带宽之间日益扩大的差距已成为现代片上系统(SoC)的关键瓶颈,尤其对于人工智能等数据并行工作负载而言。高效的点对多点(P2MP)数据传输(如组播)对实现高性能至关重要。然而,标准互连协议缺乏原生组播支持。现有的P2MP解决方案(如支持组播的片上网络)会给网络硬件带来额外开销,且需修改互连协议,从而影响可扩展性与兼容性。本文提出Torrent,一种新型分布式DMA架构,可在不修改NoC硬件和互连协议的前提下实现高效P2MP数据传输。Torrent通过在NoC上构建逻辑链进行P2MP数据传输,数据沿目标节点链式传递,形似链表遍历。这种链式写入机制在保持每次数据传输点对点本质的同时,实现了向无限多个目标节点的灵活传输。为优化链式写入的性能与能耗,我们开发了两种调度算法,可根据NoC拓扑确定最优链式顺序。基于RTL和FPGA原型的评估(使用合成与真实工作负载)表明,该架构在网络层组播方面具有显著的性能、灵活性与可扩展性优势。与单播基线相比,Torrent最高可实现7.88倍加速。在16nm工艺下的ASIC综合结果显示,该架构面积开销仅1.2%,功耗开销仅2.3%。借助链式写入机制,Torrent以每个目标节点82个周期和207平方微米的极小开销,实现了可扩展的P2MP数据传输。