Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models' ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON
翻译:社会合规导航需要对动态行人与物理约束进行结构化推理,以确保安全且可解释的决策。然而,现有社会导航数据集往往缺乏显式的推理监督,并呈现高度长尾的动作分布,限制了模型学习安全关键行为的能力。为解决这些问题,我们提出了MUSON——一个在多样化室内外校园场景中收集的、面向短时域社会导航的多模态数据集。MUSON采用结构化的五步思维链标注,包括感知、预测、推理、行动与解释,并显式建模静态物理约束及合理平衡的离散动作空间。与SNEI相比,MUSON提供了连贯的推理、行动与解释。在MUSON上对多种先进的小型视觉语言模型进行基准测试表明,Qwen2.5-VL-3B取得了最高的决策准确率0.8625,证明MUSON可作为社会合规导航有效且可复用的基准。该数据集已公开于 https://huggingface.co/datasets/MARSLab/MUSON