Recently, the use of synthetic data generated by GANs has become a popular method to do data augmentation for many applications. While practitioners celebrate this as an economical way to obtain synthetic data for training data-hungry machine learning models, it is not clear that they recognize the perils of such an augmentation technique when applied to an already-biased dataset. Although one expects GANs to replicate the distribution of the original data, in real-world settings with limited data and finite network capacity, GANs suffer from mode collapse. Especially when this data is coming from online social media platforms or the web which are never balanced. In this paper, we show that in settings where data exhibits bias along some axes (eg. gender, race), failure modes of Generative Adversarial Networks (GANs) exacerbate the biases in the generated data. More often than not, this bias is unavoidable; we empirically demonstrate that given input of a dataset of headshots of engineering faculty collected from 47 online university directory webpages in the United States is biased toward white males, a state-of-the-art (unconditional variant of) GAN "imagines" faces of synthetic engineering professors that have masculine facial features and white skin color (inferred using human studies and a state-of-the-art gender recognition system). We also conduct a preliminary case study to highlight how Snapchat's explosively popular "female" filter (widely accepted to use a conditional variant of GAN), ends up consistently lightening the skin tones in women of color when trying to make face images appear more feminine. Our study is meant to serve as a cautionary tale for the lay practitioners who may unknowingly increase the bias in their training data by using GAN-based augmentation techniques with web data and to showcase the dangers of using biased datasets for facial applications.
翻译:最近,使用GANs产生的合成数据已成为一种为许多应用软件进行数据增强的流行方法。虽然从业者认为这是一个为培训数据饥饿机器学习模型而获取合成数据的经济方法,但不清楚的是,当将GANs应用到一个已经偏向的数据集时,他们是否认识到这种增强技术的危险。虽然人们期望GANs在数据有限和网络容量有限的现实世界环境中复制原始数据的分布,但GANs受到模式崩溃的影响。特别是当该数据来自在线社交媒体平台或网络,且从未达到平衡时。在本文中,当数据显示数据在一些轴(例如性别、种族)上出现偏差的合成数据时,他们是否认识到这种增强技术的危险。 通常情况下,这种偏差是不可避免的;我们从经验上证明,从47个在线大学目录网页收集的工程系头像的输入的数据集,是偏向白男性的偏向,而这种状态是艺术的状态(不固定的颜色),我们显示,当GAN-ANserreal 直径的皮肤变式数据使用一个“我们内部的直径直径直判的性别研究,我们大脑的直径直径直判的性别学研究也意味着直判的性别学的性别学研究。我们使用GAN-直系的直系的直系的直判的直判的直判的性别学研究,在使用GAN-直系的性别学研究中, 直判的直判的直判。