We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.
翻译:我们开源了MiMo-VL-Miloco-7B及其量化变体MiMo-VL-Miloco-7B-GGUF,这是一对面向家庭场景的视觉-语言模型,在家庭场景理解与通用多模态推理任务上均展现出强劲性能。基于MiMo-VL-7B主干网络构建的MiMo-VL-Miloco-7B专为智能家居环境优化,在手势识别与常见家庭场景理解任务中取得领先的F1分数,同时在Video-MME、Video-MMMU、Charades-STA等视频基准测试,以及MMMU-Pro、MMLU-Pro等语言理解基准测试中均获得稳定性能提升。实验表明,MiMo-VL-Miloco-7B在家庭场景理解及多项多模态推理基准测试中均优于当前主流的闭源与开源基线模型。为平衡专业化与泛化能力,我们设计了一个两阶段训练流程,结合监督微调与基于Group Relative Policy Optimization的强化学习,并利用高效的多领域数据进行训练。我们进一步引入思维链监督与令牌预算感知推理机制,使模型能够以数据高效的方式学习知识,同时实现高效的推理过程。分析表明,针对家庭场景的专项训练不仅能提升活动与手势理解能力,还能改善纯文本推理性能,且在以文档为中心的任务上仅产生适度性能折衷。模型检查点、量化GGUF权重及家庭场景评估工具包已公开于https://github.com/XiaoMi/xiaomi-mimo-vl-miloco,以支持真实智能家居应用的研究与部署。