Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.
翻译:语音增强方法通常被认为能提升自动语音识别(ASR)在嘈杂环境中的性能。然而,对于基于多样化含噪数据训练的现代大规模ASR模型,这些技术的有效性并非理所当然。本研究系统评估了MetricGAN-plus-voicebank去噪方法对四种前沿ASR系统(OpenAI Whisper、NVIDIA Parakeet、Google Gemini Flash 2.0、Parrotlet-a)的影响,使用500条医疗语音录音在九种噪声条件下进行测试。ASR性能采用语义词错误率(semWER)衡量,这是一种考虑领域特定归一化的标准化词错误率(WER)指标。我们的研究揭示了一个反直觉的发现:语音增强预处理在所有噪声条件和模型中都降低了ASR性能。原始含噪音频在全部40种测试配置(4种模型×10种条件)中均获得比增强音频更低的semWER,性能退化表现为1.1%至46.6%的绝对semWER增长。这些发现表明现代ASR模型已具备足够的内部噪声鲁棒性,而传统语音增强可能会移除对ASR至关重要的声学特征。对于在嘈杂临床环境中部署医疗记录系统的从业者而言,我们的结果表明:采用降噪技术预处理音频不仅可能造成计算资源浪费,还可能对转录准确性产生潜在损害。