中央处理器(CPU,Central Processing Unit),电子计算机的主要设备之一。其功能主要是解释计算机指令以及处理计算机软件中的数据。

VIP内容

主题: Deep Learning Compiler

简介:

Apache TVM是一个用于Cpu、Gpu和专用加速器的开源深度学习编译器堆栈。它的目标是缩小以生产力为中心的深度学习框架和以性能或效率为中心的硬件后端之间的差距。在此次演讲中主要围绕AWS AI的深度学习编译器的项目展开,讲述了如何通过TVM使用预量化模型,完全从零开始添加新的操作或者是降低到现有继电器操作符的序列。

邀请嘉宾:

Yida Wang是亚马逊AWS AI团队的一名应用科学家。在加入Amazon之前,曾在Intel实验室的并行计算实验室担任研究科学家。Yida Wang在普林斯顿大学获得了计算机科学和神经科学博士学位。研究兴趣是高性能计算和大数据分析。目前的工作是优化深度学习模型对不同硬件架构的推理,例如Cpu, Gpu, TPUs。

成为VIP会员查看完整内容
0
18

最新内容

Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves $59\times$ speedup and $123\times$ energy reduction compared to embedded CPU, and $292\times$ memory reduction over a standard full-size training.

0
0
下载
预览

最新论文

Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves $59\times$ speedup and $123\times$ energy reduction compared to embedded CPU, and $292\times$ memory reduction over a standard full-size training.

0
0
下载
预览
Top