机器学习和AI系统的数据代表性 (Data Representativity for Machine Learning and AI Systems)

Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.

翻译：在通过机器学习模型从数据中推断出数据时,数据代表性至关重要。学者们越来越重视打破模型中的偏差和公平性,同时也重视输入数据中的固有偏差。然而,关于用于AI系统适当推论的样本(数据集)的代表性(数据集)的工作有限。本文回顾了具有代表性的样本的定义和概念,并调查了其在科学AI文献中的使用情况。我们引入了三个可衡量的概念,以帮助突出概念并评估不同的数据样本。此外,我们证明,在建立AI系统时,在输入空间的覆盖面方面,代表性样本与模拟目标人群分布的代表性样本之间的对比是特别相关的。我们通过对美国普查数据的经验性演示,评估这些概念的内在特性。最后,我们提出了一个问题框架,用于创建和记录带有数据代表性的数据,作为现有数据集文件模板的补充。

相关内容

Machine Learning

关注 2240

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

33页PPT【AI+天气预测】，AI and Machine learning for weather predictions

专知会员服务

34+阅读 · 2022年3月5日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日