现代机器学习的可信度面临挑战 (Underspecification Presents Challenges for Credibility in Modern Machine Learning)

Alexander D'Amour,Katherine Heller,Dan Moldovan,Ben Adlam,Babak Alipanahi,Alex Beutel,Christina Chen,Jonathan Deaton,Jacob Eisenstein,Matthew D. Hoffman,Farhad Hormozdiari,Neil Houlsby,Shaobo Hou,Ghassen Jerfel,Alan Karthikesalingam,Mario Lucic,Yian Ma,Cory McLean,Diana Mincu,Akinori Mitani,Andrea Montanari,Zachary Nado,Vivek Natarajan,Christopher Nielson,Thomas F. Osborne,Rajiv Raman,Kim Ramasamy,Rory Sayres,Jessica Schrouff,Martin Seneviratne,Shannon Sequeira,Harini Suresh,Victor Veitch,Max Vladymyrov,Xuezhi Wang,Kellie Webster,Steve Yadlowsky,Taedong Yun,Xiaohua Zhai,D. Sculley

from arxiv, Updates: Updated statistical analysis in Section 6; Additional citations

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

翻译：ML模型在部署到现实世界域时往往表现出出乎意料的不良行为。我们发现这些失败的关键原因之一是具体程度不足。当ML输油管能够返回许多在培训领域具有相当强的稳住性性能的预测器时,ML输油管被描述得不够充分。在现代ML输油管中,例如基于深层学习的输油管中,具体程度不足是常见的。以培训领域表现为基础,未指明的输油管返回的预测器通常被视为同等的预测器,但我们在这里表明,这种预测器在部署领域的行为可能非常不同。这种模糊性可能会导致实际中的不稳定和不良的示范行为,并且与先前确认的培训和部署领域之间结构不匹配的问题有明显的失败模式。我们表明,这个问题出现在大量实用的ML输油管管线上,使用了计算机视觉、医学成像、自然语言处理、基于电子健康记录临床风险预测以及医学基因组学等实例。我们的结果表明,在建模管道时,需要明确说明用于在任何领域实际世界部署的建模。