Machine learning models are becoming the primary workhorses for many applications. Production services deploy models through prediction serving systems that take in queries and return predictions by performing inference on machine learning models. In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets. Current approaches to reducing tail latency are inadequate for the latency targets of prediction serving, incur high resource overhead, or are inapplicable to the computations performed during inference. We present ParM, a novel, general framework for making use of ideas from erasure coding and machine learning to achieve low-latency, resource-efficient resilience to slowdowns and failures in prediction serving systems. ParM encodes multiple queries together into a single parity query and performs inference on the parity query using a parity model. A decoder uses the output of a parity model to reconstruct approximations of unavailable predictions. ParM uses neural networks to learn parity models that enable simple, fast encoders and decoders to reconstruct unavailable predictions for a variety of inference tasks such as image classification, speech recognition, and object localization. We build ParM atop an open-source prediction serving system and through extensive evaluation show that ParM improves overall accuracy in the face of unavailability with low latency while using 2-4$\times$ less additional resources than replication-based approaches. ParM reduces the gap between 99.9th percentile and median latency by up to $3.5\times$ compared to approaches that use an equal amount of resources, while maintaining the same median.
翻译:机器学习模型正在成为许多应用的主要工作马。 生产服务通过预测服务系统部署模型,通过对机器学习模型进行推断,在查询和返回预测中进行查询和反馈预测。 为了推广到高查询率,在组群环境中,许多机器都运行了预测服务系统,从而容易出现减速和失败,从而导致尾部延缓和违反严格的潜伏目标。 减少尾部潜伏的当前方法不足以满足预测的潜伏目标,造成高资源间接费用,或不适用于在推断期间进行的计算。 我们提出了ParM,这是利用从消除编码和机器学习中的想法实现低延时率、资源效率适应预测系统减缓和故障的一般框架。 ParM 将多项查询合并成一个单一的对等查询,并使用对等模型对等查询进行推断。 解 Cocoder利用平值模型的输出来重新估算无法提供的低百分比预测。 ParM 利用神经网络来学习对等值模型,以便能够简单、快速的对正值和对正读和正读进行理论学习,同时用不完全的预测系统来重新进行对等的预测。