Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML .NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML .NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML .NET compared to more recent entrants, and a discussion of some lessons learned.
翻译:机器学习正在从艺术和科学向每个开发者可用的技术过渡。 在不远的将来,每个平台上的每一项应用都将纳入经过训练的模型,以将开发者无法接受的基于数据的决定编码起来。这提出了重大的工程挑战,因为目前的数据科学和模型模型在很大程度上与标准的软件开发过程脱钩。这种分离使得将机器学习能力纳入应用中不必要地花费和困难,进一步阻止开发者首先接受ML。在本文中,我们介绍了微软在过去十年里开发的一个框架ML.NET,这个框架是为了应对在大型软件应用中方便运输机器学习模型的挑战。我们展示了它的架构,并说明了形成它的应用要求。具体地说,我们引入了DataView,即ML.NET的核心数据抽象,使其能够在培训和推断生命周期中高效和连贯地捕捉到完全预测性管道。我们关闭了这份文件,对ML.NET进行了令人惊讶的优异的绩效研究,与最近加入者相比,并讨论了一些经验教训。