复现与剖析用于语音识别的去噪语言模型 (Reproducing and Dissecting Denoising Language Models for Speech Recognition)

Denoising language models (DLMs) have been proposed as a powerful alternative to traditional language models (LMs) for automatic speech recognition (ASR), motivated by their ability to use bidirectional context and adapt to a specific ASR model's error patterns. However, the complexity of the DLM training pipeline has hindered wider investigation. This paper presents the first independent, large-scale empirical study of DLMs. We build and release a complete, reproducible pipeline to systematically investigate the impact of key design choices. We evaluate dozens of configurations across multiple axes, including various data augmentation techniques (e.g., SpecAugment, dropout, mixup), different text-to-speech systems, and multiple decoding strategies. Our comparative analysis in a common subword vocabulary setting demonstrates that DLMs outperform traditional LMs, but only after a distinct compute tipping point. While LMs are more efficient at lower budgets, DLMs scale better with longer training, mirroring behaviors observed in diffusion language models. However, we observe smaller improvements than those reported in prior character-based work, which indicates that the DLM's performance is conditional on factors such as the vocabulary. Our analysis reveals that a key factor for improving performance is to condition the DLM on richer information from the ASR's hypothesis space, rather than just a single best guess. To this end, we introduce DLM-sum, a novel method for decoding from multiple ASR hypotheses, which consistently outperforms the previously proposed DSR decoding method. We believe our findings and public pipeline provide a crucial foundation for the community to better understand, improve, and build upon this promising class of models. The code is publicly available at https://github.com/rwth-i6/2025-denoising-lm/.

翻译：去噪语言模型（DLMs）已被提出作为自动语音识别（ASR）中传统语言模型（LMs）的一种强大替代方案，其动机在于它们能够利用双向上下文并适应特定ASR模型的错误模式。然而，DLM训练流程的复杂性阻碍了更广泛的研究。本文首次对DLMs进行了独立的大规模实证研究。我们构建并发布了一个完整、可复现的流程，以系统性地探究关键设计选择的影响。我们在多个维度上评估了数十种配置，包括各种数据增强技术（例如SpecAugment、dropout、mixup）、不同的文本转语音系统以及多种解码策略。我们在统一子词词汇表设置下的对比分析表明，DLMs优于传统LMs，但仅在一个显著的计算拐点之后。尽管LMs在较低计算预算下更高效，但DLMs在更长时间的训练中展现出更好的扩展性，这与扩散语言模型中观察到的行为相似。然而，我们观察到的改进幅度小于先前基于字符的研究所报告的，这表明DLM的性能受词汇表等因素制约。我们的分析揭示，提升性能的一个关键因素是让DLM基于ASR假设空间中更丰富的信息进行条件化，而非仅依赖单一最佳猜测。为此，我们提出了DLM-sum，一种从多个ASR假设进行解码的新方法，该方法在性能上持续优于先前提出的DSR解码方法。我们相信，我们的发现和公开流程为学术界更好地理解、改进并发展这一有前景的模型类别提供了重要基础。代码已在https://github.com/rwth-i6/2025-denoising-lm/公开。