Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However, most of the large-scale industrial applications are often computational budget constrained. In practice, the peak workload of inference service could be 10x higher than the average cases, with the presence of unpredictable extreme cases. Lots of computational resources could be wasted during off-peak hours and the system may crash when the workload exceeds system capacity. How to support deep learning services with a dynamic workload cost-efficiently remains a challenging problem. In this paper, we address the challenge with a general and novel training scheme called model slicing, which enables deep learning models to provide predictions within the prescribed computational resource budget dynamically. Model slicing could be viewed as an elastic computation solution without requiring more computational resources. Succinctly, each layer in the model is divided into groups of a contiguous block of basic components (i.e. neurons in dense layers and channels in convolutional layers), and then partially ordered relation is introduced to these groups by enforcing that groups participated in each forward pass always starts from the first group to the dynamically-determined rightmost group. Trained by dynamically indexing the rightmost group with a single parameter slice rate, the network is engendered to build up group-wise and residual representation. Then during inference, a sub-model with fewer groups can be readily deployed for efficiency whose computation is roughly quadratic to the width controlled by the slice rate. Extensive experiments show that models trained with model slicing can effectively support on-demand workload with elastic inference cost.
翻译:深度学习模型已被用于支持超越简单汇总的解析, 更深、更广的解析模型被证明能够产生巨大的结果。 这些模型消耗了大量的内存和计算操作。 但是, 大型工业应用大多是计算预算的制约。 在实践中, 推断服务的高峰工作量可能比平均案例高出10x10倍, 并且存在无法预测的极端案例。 许多计算资源可能会在超高峰时间浪费, 当工作量超过系统能力时, 系统可能会崩溃。 如何支持具有动态工作量的深度和广度的深层学习服务, 仍然是一个挑战性的问题。 在本文中, 我们用一个叫作模版剪贴的普通和新颖的培训计划来应对挑战, 使深度学习模式能够在规定的计算资源预算内提供预测。 模型的峰值可能被视为弹性计算解决方案, 而不需要更多的计算资源。 奇怪的是, 模型中的每个层会分解成一个基底部分, 基底部分的基底部分( 即厚层的神经元, 和通道在变化层的阶层中) 。 然后, 与最深层的精细的精细的精细的计算关系会开始与最精细的计算关系 由一组组成组成组成, 由一组组成, 开始在最深层的基底的基底的一组组成, 将一组组成, 将一组组成一个固定的基底组组成。