Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art. Code and models are available at https://github.com/hchoi256/cotpl.
翻译:开放词汇目标检测(OVD)旨在识别和定位训练期间未见过的物体类别。当前方法通常利用视觉-语言模型(VLMs),通过图像-文本对齐生成伪标签,使检测器能够在没有显式监督的情况下泛化到未见类别。然而,这些方法严重依赖于直接的图像-文本匹配,忽略了理解语义复杂场景所必需的中间推理步骤,导致在面对拥挤或遮挡的视觉上下文时鲁棒性有限。本文提出CoT-PL,一个将结构化视觉思维链(CoT)推理引入伪标注过程的新框架。CoT-PL将物体理解分解为三个可解释的步骤:(1)对未见物体的区域感知,(2)通过零样本推理进行类别识别,以及(3)背景定位以分离语义复杂的物体。关键的是,第三步自然地启发了我们的对比性背景学习(CBL),该方法使用预计算的背景线索作为负样本来促进物体与背景之间的特征解耦。通过这种方式,CoT推理与CBL构成了一个集成流程,专门用于在拥挤或遮挡场景中进行鲁棒的伪标注。值得注意的是,在这两种场景下,我们针对未见类别的伪标注质量相较于先前最佳方法分别实现了103.4%和168.4%的相对提升。我们的大量实验表明,CoT-PL在开放词汇COCO数据集上对未见类别实现了+7.7 AP50的提升,在LVIS数据集上实现了+2.9 mask AP的提升,创造了新的最优性能。代码和模型可在 https://github.com/hchoi256/cotpl 获取。