Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.
翻译:视觉定位作为连接文本查询与图像特定区域的任务,在视觉-语言融合中发挥着关键作用。现有方法通常依赖于大量任务特定标注和微调,限制了其在新颖或分布外场景中的泛化能力。为克服这些局限,我们提出了GroundingAgent——一种无需任务特定微调的新型智能视觉定位框架。该框架采用结构化迭代推理机制,整合预训练的开集词汇目标检测器、多模态大语言模型(MLLMs)与大语言模型(LLMs),通过联合语义与空间分析逐步优化候选区域。值得注意的是,GroundingAgent在广泛使用的基准数据集(RefCOCO、RefCOCO+、RefCOCOg)上实现了65.1%的平均零样本定位准确率,且完全无需微调。进一步地,当用MLLM生成的描述替换原始查询文本时,仅选择阶段的准确率即可达到约90%,与监督方法性能相当,凸显了LLM推理能力的关键作用。GroundingAgent还具备强可解释性,能透明展示每个推理步骤,为其决策过程提供清晰洞察。