There is a growing need for investigating how machine learning models operate. With this work, we aim to understand trained machine learning models by questioning their data preferences. We propose a mathematical framework that allows us to probe trained models and identify their preferred samples in various scenarios including prediction-risky, parameter-sensitive, or model-contrastive samples. To showcase our framework, we pose these queries to a range of models trained on a range of classification and regression tasks, and receive answers in the form of generated data.
翻译:随着对机器学习模型运行机制的研究需求日益增长,本研究旨在通过探究模型的数据偏好来理解已训练的机器学习模型。我们提出了一个数学框架,该框架允许我们探测已训练模型,并在多种场景(包括预测风险敏感、参数敏感或模型对比样本)中识别其偏好的样本。为展示该框架的有效性,我们向一系列在分类和回归任务上训练的模型提出这些查询,并以生成数据的形式获得回答。