GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based inference is that the computational efficiency, in terms of the processing speed and energy consumption, drastically increases by processing multiple jobs together in a batch. In this paper, we formulate GPU-based inference servers as a batch service queueing model with batch-size dependent processing times. We first show that the energy efficiency of the server monotonically increases with the arrival rate of inference jobs, which suggests that it is energy-efficient to operate the inference server under a utilization level as high as possible within a latency requirement of inference jobs. We then derive a closed-form upper bound for the mean latency, which provides a simple characterization of the latency performance. Through simulation and numerical experiments, we show that the exact value of the mean latency is well approximated by this upper bound. We further compare this upper bound with the latency curve measured in real implementation of GPU-based inference servers and we show that the real performance curve is well explained by the derived simple formula.
翻译:GPU加速计算是利用深神经网络实现高速推断服务器的关键技术。基于GPU的推论的一个重要特征是,从处理速度和能源消耗的计算效率来看,通过分批处理多种工作而大大提高了计算效率。在本文中,我们将基于GPU的推论服务器作为分批服务排队模式,配有批量大小的依附处理时间。我们首先显示,服务器的能源效率与推论作业的抵达率相比,单倍提高,这说明,在推论作业的延时要求中,在尽可能高的使用水平下运行推论服务器是节能的。我们随后为平均拉长线设计了一个封闭式的上限,它提供了对延时性表现的简单描述。通过模拟和数字实验,我们显示,平均拉长的确切值与这一上限的近值非常接近。我们进一步比较了在实际执行基于GPUPU的推论服务器时测量的拉长曲线与实际执行中测量到的拉长曲线的长度曲线是节能效率的,我们展示了真实的公式。