As learning machines increase their influence on decisions concerning human lives, analyzing their fairness properties becomes a subject of central importance. Yet, our best tools for measuring the fairness of learning systems are rigid fairness metrics encapsulated as mathematical one-liners, offer limited power to the stakeholders involved in the prediction task, and are easy to manipulate when we exhort excessive pressure to optimize them. To advance these issues, we propose to shift focus from shaping fairness metrics to curating the distributions of examples under which these are computed. In particular, we posit that every claim about fairness should be immediately followed by the tagline "Fair under what examples, and collected by whom?". By highlighting connections to the literature in domain generalization, we propose to measure fairness as the ability of the system to generalize under multiple stress tests -- distributions of examples with social relevance. We encourage each stakeholder to curate one or multiple stress tests containing examples reflecting their (possibly conflicting) interests. The machine passes or fails each stress test by falling short of or exceeding a pre-defined metric value. The test results involve all stakeholders in a discussion about how to improve the learning system, and provide flexible assessments of fairness dependent on context and based on interpretable data. We provide full implementation guidelines for stress testing, illustrate both the benefits and shortcomings of this framework, and introduce a cryptographic scheme to enable a degree of prediction accountability from system providers.
翻译:随着学习机器增加对人类生活决策的影响,分析其公平性特性成为一个至关重要的主题。然而,我们衡量学习系统公平性的最佳工具是刻板的公平度量标准,以数学单行榜封装,为参与预测任务的利益攸关方提供有限的权力,当我们要求过度压力优化它们时容易操作。为了推进这些问题,我们提议把重点从制定公平度量标准转向整理据以计算这些要素的范例的分布。特别是,我们设想,关于公平性的每一项主张都应该紧接着“根据哪些实例和由谁收集的公平性”标签线。通过强调与一般化领域的文献的联系,我们建议衡量公平性,作为系统在多重压力测试(分配具有社会相关性的示例)下普遍化的能力。我们鼓励每个利益攸关方将反映其(可能相互矛盾的)利益的例子整理出一个或多个压力测试。机器通过低于或超过预先界定的计量价值而通过每次压力测试而通过或失败。测试结果应让所有利益攸关方参与讨论如何改进学习系统,并灵活评估在一般化领域对文献的关联性评估。我们提议,将公平性评估作为系统在多重压力测试下进行普及性测试的能力。我们根据数据预测框架和解释后,从全面测试,说明一个可解释的可靠性框架。我们提供一种基于度的系统缺陷和精确度评估。