Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.
翻译:视觉-语言-动作(VLA)模型继承了预训练视觉-语言模型(VLM)的强大先验知识,但简单的微调往往会破坏这些表征并损害泛化性能。现有解决方案——冻结模块或应用统一正则化——要么过度限制适应性,要么忽视了VLA组件间的不同作用。我们提出了MAPS(模块级邻近度调度),这是首个面向VLA的鲁棒微调框架。通过系统分析,我们揭示了在平衡稳定性和灵活性时应放松邻近度约束的经验顺序。MAPS线性调度这一放松过程,使视觉编码器保持接近其预训练先验,而面向动作的语言层则能更自由地适应。MAPS无需引入额外参数或数据,并可无缝集成到现有VLA中。在MiniVLA-VQ、MiniVLA-OFT、OpenVLA-OFT以及SimplerEnv、CALVIN、LIBERO等挑战性基准测试中,结合Franka Emika Panda平台的真实世界评估,MAPS持续提升了分布内和分布外性能(最高提升+30%)。我们的研究结果强调,以经验为指导保持与预训练VLM的邻近度,是实现VLM到VLA迁移中广泛泛化能力保护的简洁而有效的原则。