With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.
翻译:随着高端移动设备的频谱的出现,许多以前需要桌面水平计算能力的应用程序正在被转移到这些设备。然而,考虑到高计算和存储需求,特别是如果需要高精度的实时性能,执行深神经网络(DNNS)的推断仍然具有挑战性。提议对DNS进行重整,但现有的计划代表了设计空间的两个极端:非结构化的裁剪是精细的、准确的、但并非硬件友好的;结构化的离线性比重是粗重的、硬件效率的、但准确性能更高的。在本文中,我们引入了一个新的维度、精度的深神经网络(DNGNNNN)的剪裁模式,以及快速化的硬性文件框架。由于微量级的裁剪辑模式的精度更高精确度,我们的方法既能达到最佳的世界,又具有更高的精度。(在理论/高级端端端端、编译、硬质和软体化的硬体化的模型框架上,拟议的OTND-NF-快速的计算框架,可以显示一个大型的硬化的硬体结构。