Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. To remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized waypoint sampling, Bezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100% navigation success and 0.09s inference, and adapts to a new robot using only-500 additional visual samples. It generalizes reliably to unseen environments in simulation and real-world trials, offering a scalable, prompt-free approach to unified traversability reasoning and trajectory generation.
翻译:视觉可通行性估计对自主导航至关重要,但现有基于视觉语言模型的方法依赖人工设计的提示,在不同实体间的泛化能力差,且仅输出可通行性地图,将轨迹生成任务留给缓慢的外部规划器。我们提出SwarmDiffusion,一种轻量级端到端扩散模型,能够从单张RGB图像中联合预测可通行性并生成可行轨迹。为消除对标注数据或规划器生成路径的依赖,我们引入一种基于随机航点采样、贝塞尔平滑以及强制连通性、安全性、方向性与路径细度正则化的无规划器轨迹构建流程。这使得模型无需演示即可学习稳定的运动先验。SwarmDiffusion无需提示工程即可利用视觉语言模型衍生的监督信号,并通过紧凑的实体状态对扩散过程进行条件化,生成物理一致、可通行的路径,并能迁移至不同机器人平台。在室内环境及两种实体(四足机器人与空中机器人)上的实验表明,该方法实现了80-100%的导航成功率与0.09秒的推理速度,且仅需额外500个视觉样本即可适应新机器人。模型在仿真与真实世界试验中对未见环境展现出可靠的泛化能力,为统一的可通行性推理与轨迹生成提供了一种可扩展、无提示的解决方案。