Large language models produce human-like text that drive a growing number of applications. However, recent literature and, increasingly, real world observations, have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. Though work to evaluate language model harms is under way, translating foresight about which harms may arise into rigorous benchmarks is not straightforward. To facilitate this translation, we outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks. We then use these characteristics as a lens to identify trends and gaps in existing benchmarks. Finally, we apply them in a case study of the Perspective API, a toxicity classifier that is widely used in harm benchmarks. Our characteristics provide one piece of the bridge that translates between foresight and effective evaluation.
翻译:大型语言模型产生像人一样的文字,推动越来越多的应用。然而,最近的文献和越来越多的现实世界观测表明,这些模型可以产生有毒、有偏见、不真实或其他有害语言。虽然评价语言模型伤害的工作正在进行之中,但将关于哪些伤害的远见转化为严格的基准并非直截了当。为了便于翻译,我们概述了在设计新基准时值得明确考虑的六种描述有害文字的方法。然后,我们将这些特征作为透镜,以确定现有基准的趋势和差距。最后,我们将这些特征应用于对《观点 API》的案例研究中,这是一个在伤害基准中广泛使用的毒性分类器。我们的特点提供了将展望与有效评估联系起来的桥梁之一。