通过预测阻碍分类 (Obstructing Classification via Projection)

Machine learning and data mining techniques are effective tools to classify large amounts of data. But they tend to preserve any inherent bias in the data, for example, with regards to gender or race. Removing such bias from data or the learned representations is quite challenging. In this paper we study a geometric problem which models a possible approach for bias removal. Our input is a set of points P in Euclidean space R^d and each point is labeled with k binary-valued properties. A priori we assume that it is "easy" to classify the data according to each property. Our goal is to obstruct the classification according to one property by a suitable projection to a lower-dimensional Euclidean space R^m (m < d), while classification according to all other properties remains easy. What it means for classification to be easy depends on the classification model used. We first consider classification by linear separability as employed by support vector machines. We use Kirchberger's Theorem to show that, under certain conditions, a simple projection to R^(d-1) suffices to eliminate the linear separability of one of the properties whilst maintaining the linear separability of the other properties. We also study the problem of maximizing the linear "inseparability" of the chosen property. Second, we consider more complex forms of separability and prove a connection between the number of projections required to obstruct classification and the Helly-type properties of such separabilities.

翻译：机器学习和数据挖掘技术是大量数据分类的有效工具。但是它们倾向于保留数据中任何内在的偏差, 例如性别或种族方面的偏差。从数据或学习的表述中消除这种偏差相当具有挑战性。在本文中, 我们研究一个几何问题, 以可能的偏差清除方法为模型。我们的投入是一套在 Euclidean 空间的 P 点, 并且每个点贴上 k 双值属性的标签。我们先用 Kirchberger 的理论来表明, 在某些条件下, 对每个属性进行分类是“ 容易的 ” 。我们的目标是通过对低维度 Euclidean 空间 Rcm (m < d) 进行适当的预测来阻碍对一个属性进行分类, 而根据所有其他属性进行分类仍然是容易的。分类意味着什么容易根据使用的分类模式进行分类。我们首先考虑使用支持矢量机器所使用的线性分离性分类。我们使用 Kirchberger的理论来表明, 在某些条件下, 对R & (d) 进行简单的预测就足以消除一个属性的直线性偏差性分类,,, 从而消除一个属性的线性分隔, 的连接, 同时考虑“ 我们还要考虑“ 和选择的精度的精度的精度的精度的精度的精度的精度的精度的精度” 的精度的精度的精度, 。