We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.
翻译:本文介绍了LongCat-Flash-Omni,这是一个拥有5600亿参数、在实时视听交互方面表现卓越的先进开源全模态模型。通过采用课程启发式的渐进训练策略——从较简单的模态序列建模任务逐步过渡到日益复杂的任务,LongCat-Flash-Omni在保持强大单模态能力的同时,获得了全面的多模态能力。该模型基于LongCat-Flash构建,后者采用了高性能的捷径连接专家混合架构,并配备零计算开销的专家模块。LongCat-Flash-Omni进一步集成了高效的多模态感知与语音重建模块。尽管模型规模高达5600亿参数(其中270亿为激活参数),LongCat-Flash-Omni仍能实现低延迟的实时视听交互。在训练基础设施方面,我们开发了专门设计的模态解耦并行方案,以应对大规模多模态训练中固有的数据和模型异质性。这一创新方法展现出卓越的效率,能够维持超过纯文本训练90%的吞吐量。广泛的评估表明,LongCat-Flash-Omni在开源模型的全模态基准测试中达到了最先进的性能水平。此外,该模型在广泛的模态特定任务上均展现出高度竞争力,包括文本、图像和视频理解,以及音频理解与生成。我们全面概述了模型架构设计、训练流程与数据策略,并将模型开源以促进社区未来的研究与发展。