Neuromorphic accelerators offer promising platforms for machine learning (ML) inference by leveraging event-driven, spatially-expanded architectures that naturally exploit unstructured sparsity through co-located memory and compute. However, their unique architectural characteristics create performance dynamics that differ fundamentally from conventional accelerators. Existing workload optimization approaches for neuromorphic accelerators rely on aggregate network-wide sparsity and operation counting, but the extent to which these metrics actually improve deployed performance remains unknown. This paper presents the first comprehensive performance bound and bottleneck analysis of neuromorphic accelerators, revealing the shortcomings of the conventional metrics and offering an understanding of what facets matter for workload performance. We present both theoretical analytical modeling and extensive empirical characterization of three real neuromorphic accelerators: Brainchip AKD1000, Synsense Speck, and Intel Loihi 2. From these, we establish three distinct accelerator bottleneck states, memory-bound, compute-bound, and traffic-bound, and identify which workload configuration features are likely to exhibit these bottleneck states. We synthesize all of our insights into the floorline performance model, a visual model that identifies performance bounds and informs how to optimize a given workload, based on its position on the model. Finally, we present an optimization methodology that combines sparsity-aware training with floorline-informed partitioning. Our methodology achieves substantial performance improvements at iso-accuracy: up to 3.86x runtime improvement and 3.38x energy reduction compared to prior manually-tuned configurations.
翻译:神经形态加速器通过利用事件驱动、空间扩展的架构,自然地通过共址内存与计算单元利用非结构化稀疏性,为机器学习推理提供了前景广阔的平台。然而,其独特的架构特性产生了与传统加速器根本不同的性能动态。现有针对神经形态加速器的工作负载优化方法依赖于网络层面的聚合稀疏度与操作计数,但这些指标在多大程度上能实际提升部署后的性能仍属未知。本文首次对神经形态加速器进行了全面的性能界限与瓶颈分析,揭示了传统指标的不足,并阐明了影响工作负载性能的关键因素。我们提出了理论分析建模,并对三款真实的神经形态加速器——Brainchip AKD1000、Synsense Speck 和 Intel Loihi 2——进行了广泛的实证表征。基于此,我们确立了三种不同的加速器瓶颈状态:内存受限、计算受限和通信受限,并识别了哪些工作负载配置特征可能表现出这些瓶颈状态。我们将所有洞见综合为底线性能模型,这是一种可视化模型,可根据工作负载在模型中的位置识别性能界限并指导其优化方法。最后,我们提出了一种结合稀疏感知训练与基于底线模型的分区优化方法。我们的方法在保持同等精度的情况下实现了显著的性能提升:与先前手动调优的配置相比,运行时最高提升3.86倍,能耗降低3.38倍。