Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.
翻译:许多关键应用要求神经网络具备亚微秒级推理延迟。针对部署在FPGA上的此类应用,我们提出了高粒度量化(HGQ)——一种通过梯度下降优化参数量化位宽的量化感知训练框架。与传统方法不同,HGQ独立确定每个参数的最优位宽,使其适用于支持异构任意精度运算的硬件平台。实验表明,HGQ在多个基准任务上相较于现有网络压缩方法具有更优性能,在保持精度的同时实现了资源消耗与延迟的数量级降低。这些改进使得原本受资源或延迟限制而无法部署的复杂模型成为可能。HGQ已开源,并正应用于欧洲核子研究中心ATLAS与CMS实验的下一代粒子物理触发系统开发,使得采用先进机器学习模型实现亚微秒延迟的实时数据筛选成为可能。