Clustering is a fundamental task in machine learning and data analysis, but it frequently fails to provide fair representation for various marginalized communities defined by multiple protected attributes -- a shortcoming often caused by biases in the training data. As a result, there is a growing need to enhance the fairness of clustering outcomes, ideally by making minimal modifications, possibly as a post-processing step after conventional clustering. Recently, Chakraborty et al. [COLT'25] initiated the study of \emph{closest fair clustering}, though in a restricted scenario where data points belong to only two groups. In practice, however, data points are typically characterized by many groups, reflecting diverse protected attributes such as age, ethnicity, gender, etc. In this work, we generalize the study of the \emph{closest fair clustering} problem to settings with an arbitrary number (more than two) of groups. We begin by showing that the problem is NP-hard even when all groups are of equal size -- a stark contrast with the two-group case, for which an exact algorithm exists. Next, we propose near-linear time approximation algorithms that efficiently handle arbitrary-sized multiple groups, thereby answering an open question posed by Chakraborty et al. [COLT'25]. Leveraging our closest fair clustering algorithms, we further achieve improved approximation guarantees for the \emph{fair correlation clustering} problem, advancing the state-of-the-art results established by Ahmadian et al. [AISTATS'20] and Ahmadi et al. [2020]. Additionally, we are the first to provide approximation algorithms for the \emph{fair consensus clustering} problem involving multiple (more than two) groups, thus addressing another open direction highlighted by Chakraborty et al. [COLT'25].
翻译:聚类是机器学习和数据分析中的一项基础任务,但它常常无法为基于多个受保护属性定义的各类边缘化群体提供公平表征——这一缺陷通常由训练数据中的偏差所导致。因此,亟需提升聚类结果的公平性,理想情况下应通过最小化修改(可能作为传统聚类后的后处理步骤)来实现。近期,Chakraborty等人[COLT'25]首次研究了\\emph{最接近公平聚类}问题,但其研究局限于数据点仅属于两个组别的场景。然而在实际应用中,数据点通常具有多个组别特征,反映了年龄、种族、性别等多样化的受保护属性。本文中,我们将\\emph{最接近公平聚类}问题的研究推广至任意数量(超过两个)组别的设定。我们首先证明,即使所有组别规模相等,该问题仍是NP难的——这与存在精确算法的双组情况形成鲜明对比。接着,我们提出了近线性时间的近似算法,可高效处理任意规模的多组情况,从而回答了Chakraborty等人[COLT'25]提出的一个开放性问题。基于我们的最接近公平聚类算法,我们进一步为\\emph{公平相关聚类}问题实现了改进的近似保证,推进了Ahmadian等人[AISTATS'20]与Ahmadi等人[2020]所建立的前沿成果。此外,我们首次为涉及多个(超过两个)组别的\\emph{公平共识聚类}问题提供了近似算法,从而解决了Chakraborty等人[COLT'25]强调的另一个开放研究方向。