Fun-ASR技术报告 (Fun-ASR Technical Report)

Keyu An,Yanni Chen,Zhigao Chen,Chong Deng,Zhihao Du,Changfeng Gao,Zhifu Gao,Bo Gong,Xiangang Li,Yabin Li,Ying Liu,Xiang Lv,Yunjie Ji,Yiheng Jiang,Bin Ma,Haoneng Luo,Chongjia Ni,Zexu Pan,Yiping Peng,Zhendong Peng,Peiyao Wang,Hao Wang,Haoxu Wang,Wen Wang,Wupeng Wang,Yuzhong Wu,Biao Tian,Zhentao Tan,Nan Yang,Bin Yuan,Jieping Ye,Jixing Yu,Qinglin Zhang,Kun Zou,Han Zhao,Shengkui Zhao,Jingren Zhou,Yanqiao Zhu

from arxiv, Authors are listed in alphabetical order. Work in progress

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. The code and models are accessible at https://github.com/FunAudioLLM/Fun-ASR .

翻译：近年来，自动语音识别（ASR）领域经历了变革性的进步，这主要得益于三个互补范式的发展：数据规模化、模型规模化以及与大型语言模型（LLMs）的深度融合。然而，LLMs容易产生幻觉，这在现实世界的ASR应用中可能显著降低用户体验。本文提出了Fun-ASR，一个基于LLM的大规模ASR系统。它协同结合了海量数据、大模型容量、LLM集成以及强化学习，旨在在各种复杂语音识别场景中实现最先进的性能。此外，Fun-ASR专门针对实际部署进行了优化，在流式处理能力、噪声鲁棒性、语码转换、热词定制以及满足其他现实应用需求方面均有增强。实验结果表明，虽然大多数基于LLM的ASR系统在开源基准测试上表现优异，但在真实的工业评估集上往往表现不佳。得益于面向生产的优化，Fun-ASR在真实应用数据集上取得了最先进的性能，证明了其在实践环境中的有效性和鲁棒性。代码和模型可通过 https://github.com/FunAudioLLM/Fun-ASR 获取。