In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. The code and models are accessible at https://github.com/FunAudioLLM/Fun-ASR .
翻译:近年来,自动语音识别(ASR)领域经历了变革性的进步,这主要得益于三个互补范式的发展:数据规模化、模型规模化以及与大型语言模型(LLMs)的深度融合。然而,LLMs容易产生幻觉,这在现实世界的ASR应用中可能显著降低用户体验。本文提出了Fun-ASR,一个基于LLM的大规模ASR系统。它协同结合了海量数据、大模型容量、LLM集成以及强化学习,旨在在各种复杂语音识别场景中实现最先进的性能。此外,Fun-ASR专门针对实际部署进行了优化,在流式处理能力、噪声鲁棒性、语码转换、热词定制以及满足其他现实应用需求方面均有增强。实验结果表明,虽然大多数基于LLM的ASR系统在开源基准测试上表现优异,但在真实的工业评估集上往往表现不佳。得益于面向生产的优化,Fun-ASR在真实应用数据集上取得了最先进的性能,证明了其在实践环境中的有效性和鲁棒性。代码和模型可通过 https://github.com/FunAudioLLM/Fun-ASR 获取。