Zero-shot learning aims to recognize unseen objects using their semantic representations. Most existing works use visual attributes labeled by humans, not suitable for large-scale applications. In this paper, we revisit the use of documents as semantic representations. We argue that documents like Wikipedia pages contain rich visual information, which however can easily be buried by the vast amount of non-visual sentences. To address this issue, we propose a semi-automatic mechanism for visual sentence extraction that leverages the document section headers and the clustering structure of visual sentences. The extracted visual sentences, after a novel weighting scheme to distinguish similar classes, essentially form semantic representations like visual attributes but need much less human effort. On the ImageNet dataset with over 10,000 unseen classes, our representations lead to a 64% relative improvement against the commonly used ones.
翻译:零点学习的目的是利用语义表达式来识别看不见的物体。 大部分现有作品使用人类标记的视觉特征, 不适合大规模应用。 在本文中, 我们重新审视文件作为语义表达式的使用。 我们争论说, 维基百科网页等文件包含丰富的视觉信息, 但是这些信息很容易被大量的非视觉句子掩埋。 为了解决这个问题, 我们提出一个半自动的视觉句提取机制, 利用文档部分页眉和视觉句群集结构。 提取的视觉句子, 经过新颖的加权计划来区分相似的类别, 基本上形成像视觉特征一样的语义表达式, 但更不需要人的努力。 在图像网络中, 超过10 000个不可见的类的数据集, 我们的表达方式导致与常用的相近64%的相对改进。