获取人类感官与文字描述符的相似性 (Words are all you need? Capturing human sensory similarity with textual descriptors)

Recent advances in multimodal training use textual descriptions to significantly enhance machine understanding of images and videos. Yet, it remains unclear to what extent language can fully capture sensory experiences across different modalities. A well-established approach for characterizing sensory experiences relies on similarity judgments, namely, the degree to which people perceive two distinct stimuli as similar. We explore the relation between human similarity judgments and language in a series of large-scale behavioral studies ($N=1,823$ participants) across three modalities (images, audio, and video) and two types of text descriptors: simple word tags and free-text captions. In doing so, we introduce a novel adaptive pipeline for tag mining that is both efficient and domain-general. We show that our prediction pipeline based on text descriptors exhibits excellent performance, and we compare it against a comprehensive array of 611 baseline models based on vision-, audio-, and video-processing architectures. We further show that the degree to which textual descriptors and models predict human similarity varies across and within modalities. Taken together, these studies illustrate the value of integrating machine learning and cognitive science approaches to better understand the similarities and differences between human and machine representations. We present an interactive visualization at https://words-are-all-you-need.s3.amazonaws.com/index.html for exploring the similarity between stimuli as experienced by humans and different methods reported in the paper.

翻译：在多式联运培训方面,最近的进展是使用文字描述,以大大增进对图像和视频的机体理解。然而,对于语言在多大程度上能够充分捕捉不同模式的感官经验,仍然不清楚。一种成熟的感官经验定性方法依赖于相似性判断,即人们认为两种截然不同的刺激因素具有相似性的程度。我们在一系列大规模的行为研究中探索人类相似性判断和语言之间的关系(N=1,823参与者),这三种模式(图像、音频和视频)和两种文本描述器:简单的字标记和自由文本说明。在这样做时,我们为标记采矿引入一个新的适应性管道,既高效又通用。我们表明我们基于文字描述器的预测管道表现优异,我们将其与基于视觉、音频和视频处理结构的611个基线模型的综合阵列(N=1823参与者)相比较。我们进一步表明,文字描述和模型预测人类相似性的程度在不同模式和不同模式内的差异。这些研究加在一起,显示了将机器学习和认知性科学方法相结合的价值,以便更好地了解目前/视觉-图像-系统之间的相似性对比。我们报告了目前-图像-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-