We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.
翻译:我们介绍了LongCat-Flash-Omni,这是一个拥有5600亿参数、在实时视听交互方面表现卓越的先进开源全模态模型。通过采用课程启发的渐进式训练策略——从较简单的模态序列建模任务逐步过渡到日益复杂的任务,LongCat-Flash-Omni在保持强大单模态能力的同时,获得了全面的多模态能力。该模型基于LongCat-Flash构建,后者采用了高性能的捷径连接专家混合架构(MoE),并包含零计算专家。LongCat-Flash-Omni进一步集成了高效的多模态感知与语音重建模块。尽管其参数量高达5600亿(其中270亿为激活参数),LongCat-Flash-Omni仍能实现低延迟的实时视听交互。在训练基础设施方面,我们开发了一种模态解耦并行方案,专门设计用于处理大规模多模态训练中固有的数据和模型异质性。这一创新方法通过维持超过纯文本训练90%的吞吐量,展现了卓越的效率。广泛的评估表明,LongCat-Flash-Omni在开源模型中,于全模态基准测试上达到了最先进的性能。此外,它在广泛的模态特定任务中均取得了极具竞争力的结果,包括文本、图像和视频理解,以及音频理解与生成。我们全面概述了模型架构设计、训练流程和数据策略,并将模型开源,以促进社区未来的研究与发展。