This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
翻译:本文提出了VLA-AN,一种高效、机载的视觉-语言-动作(VLA)框架,专门用于复杂环境下的自主无人机导航。VLA-AN解决了现有大型空中导航模型的四个主要局限性:数据域差距、推理能力不足的时序导航、生成式动作策略的安全问题以及机载部署约束。首先,我们利用3D高斯溅射(3D-GS)构建了一个高保真数据集,以有效弥合域差距。其次,我们引入了一个渐进式的三阶段训练框架,依次强化场景理解、核心飞行技能和复杂导航能力。第三,我们设计了一个轻量级、实时的动作模块,并结合了几何安全校正。该模块确保了快速、无碰撞且稳定的指令生成,从而减轻了随机生成策略固有的安全风险。最后,通过对机载部署流程的深度优化,VLA-AN在资源受限的无人机上实现了推理吞吐量稳健的实时8.3倍提升。大量实验表明,VLA-AN显著提升了空间定位、场景推理和长时程导航能力,最高单任务成功率达到了98.1%,为实现轻量级空中机器人全链条闭环自主性提供了一个高效、实用的解决方案。