高性能和能效高的推论,以深入学习ARM加工器 (High performance and energy efficient inference for deep learning on ARM processors)

We evolve PyDTNN, a framework for distributed parallel training of Deep Neural Networks (DNNs), into an efficient inference tool for convolutional neural networks. Our optimization process on multicore ARM processors involves several high-level transformations of the original framework, such as the development and integration of Cython routines to exploit thread-level parallelism; the design and development of micro-kernels for the matrix multiplication, vectorized with ARMs NEON intrinsics, that can accommodate layer fusion; and the appropriate selection of several cache configuration parameters tailored to the memory hierarchy of the target ARM processors. Our experiments evaluate both inference throughput (measured in processed images/s) and inference latency (i.e., time-to-response) as well as energy consumption per image when varying the level of thread parallelism and the processor power modes. The experiments with the new inference engine are reported for the ResNet50 v1.5 model on the ImageNet dataset from the MLPerf suite using the ARM v8.2 cores in the NVIDIA Jetson AGX Xavier board. These results show superior performance compared with the well-spread TFLite from Google and slightly inferior results when compared with ArmNN, the native library from ARM for DNN inference.

翻译：我们把深神经网络分布平行培训框架PyDTNN(DNNs)发展成一个框架,用于对深神经网络进行分布式平行培训,为进化神经网络提供一个高效的推断工具。我们对多核心ARM处理器的优化过程涉及对原始框架进行若干高层次的改造,例如开发和整合Cython例行程序,以利用线平行和处理器动力模式的不同程度;设计和开发矩阵倍增的微型内核,与ARMS近地物体内含的矢量相联,可以容纳层融合;适当选择适合目标ARM处理器内存等级的若干缓存配置参数。我们的实验既评估了推断性(以已处理图像/s衡量的)和推断性拉长(即时间到反应),也评估了在线平行和处理器电动模式不同的情况下,每图像的能量消耗。在使用ARMVMVDVD D8.2核心从MLPerf套装图像网中适当挑选出一些缓冲配置参数。这些图像网模型的实验用ADIAGIS公司与SDRAVAVA的高级结果,这些比NFISDR的SBR的SBA和SB的SBR的SBABAB的高级结果与SB的SB的SB的SB的SB的SB的SB的优性结果。