连接点：基于智能推理的无训练视觉定位 (Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning)

Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

翻译：视觉定位作为连接文本查询与图像特定区域的任务，在视觉-语言融合中发挥着关键作用。现有方法通常依赖于大量任务特定标注和微调，限制了其在新颖或分布外场景中的泛化能力。为克服这些局限，我们提出了GroundingAgent——一种无需任务特定微调的新型智能视觉定位框架。该框架采用结构化迭代推理机制，整合预训练的开集词汇目标检测器、多模态大语言模型（MLLMs）与大语言模型（LLMs），通过联合语义与空间分析逐步优化候选区域。值得注意的是，GroundingAgent在广泛使用的基准数据集（RefCOCO、RefCOCO+、RefCOCOg）上实现了65.1%的平均零样本定位准确率，且完全无需微调。进一步地，当用MLLM生成的描述替换原始查询文本时，仅选择阶段的准确率即可达到约90%，与监督方法性能相当，凸显了LLM推理能力的关键作用。GroundingAgent还具备强可解释性，能透明展示每个推理步骤，为其决策过程提供清晰洞察。