Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are evaluated: (1) a base LLaMA-2-7B model without fine-tuning, (2) a fine-tuned variant trained on text-only hypotheses, and (3) a noise-conditioned model integrating ELN embeddings at both sentence and word levels. Experimental results demonstrate that the ELN-conditioned model achieves substantial reductions in Word Error Rate (WER). Specifically, on the challenging Mixed Noise test set, the proposed Fine-tuned + ELN (Ours) model reduces the WER from a baseline of 31.10\% (Raw Whisper) to 24.84\%, significantly surpassing the Fine-tuned (No ELN) text-only baseline of 30.79\%, whereas the original LLaMA-2-7B model increased the WER to 64.58\%, demonstrating that it is unable to correct Persian errors on its own. This confirms the effectiveness of combining multiple hypotheses with noise-aware embeddings for robust Persian ASR in noisy real-world scenarios.
翻译:自动语音识别系统在嘈杂环境中的性能会显著下降,这一挑战对于波斯语等低资源语言尤为严重。即使是诸如Whisper这样的最先进模型,在不同信噪比条件下也难以保持准确性。本研究提出了一种鲁棒的噪声敏感ASR纠错框架,该框架结合了多重假设和噪声感知建模。利用含噪的波斯语语音,我们从改进的Whisper-large解码器生成了5个最佳假设。误差级噪声被提出作为一种表征,用于捕捉跨假设的语义级和词元级分歧,从而量化由噪声引起的语言失真。因此,ELN提供了一种衡量噪声引起的不确定性的直接方法,使LLM能够在纠错过程中推理每个假设的可靠性。我们评估了三种模型:一个未经微调的LLaMA-2-7B基础模型;一个在纯文本假设上训练的微调变体;以及一个在句子和词级别都集成了ELN嵌入的噪声条件化模型。实验结果表明,ELN条件化模型显著降低了词错误率。具体而言,在具有挑战性的混合噪声测试集上,我们提出的微调+ELN模型将WER从31.10%的基线降低到24.84%,显著超越了纯文本微调基线的30.79%,而原始的LLaMA-2-7B模型则将WER增加到64.58%,表明其自身无法纠正波斯语错误。这证实了在嘈杂的现实场景中,将多重假设与噪声感知嵌入相结合对于实现鲁棒的波斯语ASR是有效的。