Visuomotor policies based on generative architectures such as diffusion and flow-based matching have shown strong performance but degrade under distribution shifts, demonstrating limited recovery capabilities without costly finetuning. In the language modeling domain, test-time compute scaling has revolutionized reasoning capabilities of modern LLMs by leveraging additional inference-time compute for candidate solution refinement. These methods typically leverage foundation models as verification modules in a zero-shot manner to synthesize improved candidate solutions. In this work, we hypothesize that generative policies can similarly benefit from additional inference-time compute that employs zero-shot VLM-based verifiers. A systematic analysis of improving policy performance through the generation-verification framework remains relatively underexplored in the current literature. To this end, we introduce EVE - a modular, generator-verifier interaction framework - that boosts the performance of pretrained generative policies at test time, with no additional training. EVE wraps a frozen base policy with multiple zero-shot, VLM-based verifier agents. Each verifier proposes action refinements to the base policy candidate actions, while an action incorporator fuses the aggregated verifier output into the base policy action prediction to produce the final executed action. We study design choices for generator-verifier information interfacing across a system of verifiers with distinct capabilities. Across a diverse suite of manipulation tasks, EVE consistently improves task success rates without any additional policy training. Through extensive ablations, we isolate the contribution of verifier capabilities and action incorporator strategies, offering practical guidelines to build scalable, modular generator-verifier systems for embodied control.
翻译:基于扩散和流匹配等生成架构的视觉运动策略已展现出强大性能,但在分布偏移下会出现性能退化,且在不进行代价高昂的微调时恢复能力有限。在语言建模领域,测试时计算扩展通过利用额外的推理时间计算来优化候选解决方案,彻底改变了现代大语言模型的推理能力。这些方法通常以零样本方式利用基础模型作为验证模块来合成改进的候选解决方案。在本研究中,我们假设生成策略同样可以从采用零样本视觉语言模型验证器的额外推理时间计算中受益。当前文献中,通过生成-验证框架提升策略性能的系统性分析仍相对不足。为此,我们提出了EVE——一个模块化的生成器-验证器交互框架——该框架无需额外训练即可在测试时提升预训练生成策略的性能。EVE通过多个零样本视觉语言模型验证器智能体封装冻结的基础策略。每个验证器对基础策略的候选动作提出改进建议,而动作融合器则将聚合后的验证器输出与基础策略动作预测融合,生成最终执行动作。我们研究了具有不同能力的验证器系统中生成器与验证器信息交互的设计方案。在多样化的操作任务套件中,EVE始终能提高任务成功率,且无需任何额外的策略训练。通过大量消融实验,我们分离了验证器能力与动作融合策略的贡献,为构建可扩展、模块化的具身控制生成器-验证器系统提供了实用指南。