Zero-shot image classification has made promising progress by training the aligned image and text encoders. The goal of this work is to advance zero-shot object detection, which aims to detect novel objects without bounding box nor mask annotations. We propose ViLD, a training method via Vision and Language knowledge Distillation. We distill the knowledge from a pre-trained zero-shot image classification model (e.g., CLIP) into a two-stage detector (e.g., Mask R-CNN). Our method aligns the region embeddings in the detector to the text and image embeddings inferred by the pre-trained model. We use the text embeddings as the detection classifier, obtained by feeding category names into the pre-trained text encoder. We then minimize the distance between the region embeddings and image embeddings, obtained by feeding region proposals into the pre-trained image encoder. During inference, we include text embeddings of novel categories into the detection classifier for zero-shot detection. We benchmark the performance on LVIS dataset by holding out all rare categories as novel categories. ViLD obtains 16.1 mask AP$_r$ with a Mask R-CNN (ResNet-50 FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8. The model can directly transfer to other datasets, achieving 72.2 AP$_{50}$, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.
翻译:通过对匹配图像和文本编码器进行培训,零点图像分类取得了令人乐观的进展。这项工作的目标是推进零点对象探测,目的是在不设框或掩码说明的情况下探测新对象。我们提议通过ViLD,一种通过View和语言知识蒸馏的培训方法。我们将预先培训的零点图像分类模型(例如CLIP)的知识转化为一个两阶段检测器(例如Mask R-CNN)。我们的方法是将区域嵌入检测器的零点物体检测器和预培训模型所推断的文本和图像嵌入的。我们使用文本嵌入器作为检测分类器,通过将类别名称输入预培训的文本编码器。我们随后将区域嵌入和图像嵌入之间的距离最小化,通过将区域提案输入预培训图像编码器获得。我们推论,我们将新版本的文本嵌入了用于零点检测的APL$。我们用LVIS数据设置在L$和图像嵌入中,我们用所有稀有的类别,将AR-NRRR-RR5, 和S-RRRAS-RAS-C 直接转换成一个新的类别。