Most machine learning methods are known to capture and exploit biases of the training data. While some biases are beneficial for learning, others are harmful. Specifically, image captioning models tend to exaggerate biases present in training data (e.g., if a word is present in 60% of training sentences, it might be predicted in 70% of sentences at test time). This can lead to incorrect captions in domains where unbiased captions are desired, or required, due to over-reliance on the learned prior and image context. In this work we investigate generation of gender-specific caption words (e.g. man, woman) based on the person's appearance or the image context. We introduce a new Equalizer model that ensures equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present. The resulting model is forced to look at a person rather than use contextual cues to make a gender-specific predictions. The losses that comprise our model, the Appearance Confusion Loss and the Confident Loss, are general, and can be added to any description model in order to mitigate impacts of unwanted bias in a description dataset. Our proposed model has lower error than prior work when describing images with people and mentioning their gender and more closely matches the ground truth ratio of sentences including women to sentences including men. We also show that unlike other approaches, our model is indeed more often looking at people when predicting their gender.
翻译:已知的多数机器学习方法都是用来捕捉和利用培训数据的偏差,虽然有些偏差有利于学习,但另一些偏差是有害的。具体地说,图像字幕模型往往夸大培训数据中存在的偏差(例如,如果在60%的培训判决中出现一个单词,则在测试时可以预测70%的刑期中出现这种偏差),这可能导致由于过分依赖学过前和图像背景而希望或需要的不带偏见的领域中出现字幕不正确。在这项工作中,我们根据个人的外观或图像背景调查产生性别专用字幕(例如,男子、妇女)。我们引入一种新的平等化模型,确保在性别证据出现在现场时,性别证据出现在有自信的预测中,确保性别可能性平等。因此,模型被迫看一个人,而不是用背景提示来作出针对性别的预测。我们模型、 " 适应模型损失和自信损失 " 属于一般性质,可以添加到任何描述模型中,以便减轻在描述某人的外观或图像背景中出现不想要的偏差影响。我们提出的模型往往比男性更像更像,包括更低的推理错。我们之前的模型比更像更像更像。我们的人更像更像更像。我们更像更像更像更像更像更像更像更像更像更像更像更像了。我们更像更像更像更像更像更像更像更像了。我们更像更像更像更像更像了。我们更像更像更像更像、更像更像更像了。