Voice-based human-machine interaction is a primary modality for accessing intelligent systems, yet individuals with dysarthria face systematic exclusion due to recognition performance gaps. Whilst automatic speech recognition (ASR) achieves word error rates (WER) below 5% on typical speech, performance degrades dramatically for dysarthric speakers. Multimodal large language models (MLLMs) offer potential for leveraging contextual reasoning to compensate for acoustic degradation, yet their zero-shot capabilities remain uncharacterised. This study evaluates eight commercial speech-to-text services on the TORGO dysarthric speech corpus: four conventional ASR systems (AssemblyAI, Whisper large-v3, Deepgram Nova-3, Nova-3 Medical) and four MLLM-based systems (GPT-4o, GPT-4o Mini, Gemini 2.5 Pro, Gemini 2.5 Flash). Evaluation encompasses lexical accuracy, semantic preservation, and cost-latency trade-offs. Results demonstrate severity-dependent degradation: mild dysarthria achieves 3-5% WER approaching typical-speech benchmarks, whilst severe dysarthria exceeds 49% WER across all systems. A verbatim-transcription prompt yields architecture-specific effects: GPT-4o achieves 7.36 percentage point WER reduction with consistent improvement across all tested speakers, whilst Gemini variants exhibit degradation. Semantic metrics indicate that communicative intent remains partially recoverable despite elevated lexical error rates. These findings establish empirical baselines enabling evidence-based technology selection for assistive voice interface deployment.
翻译:语音交互作为人机交互的核心模态,是访问智能系统的主要途径,然而构音障碍患者因语音识别性能差距而面临系统性排斥。尽管自动语音识别系统在典型语音上的词错误率已低于5%,但其对构音障碍者的识别性能会急剧下降。多模态大语言模型具备利用上下文推理补偿声学退化的潜力,但其零样本能力尚未得到系统评估。本研究在TORGO构音障碍语音数据库上评估了八种商用语音转写服务:包括四种传统ASR系统(AssemblyAI、Whisper large-v3、Deepgram Nova-3、Nova-3 Medical)与四种基于MLLM的系统(GPT-4o、GPT-4o Mini、Gemini 2.5 Pro、Gemini 2.5 Flash)。评估涵盖词汇准确性、语义保持度及成本-延迟权衡。结果显示性能退化与障碍严重程度相关:轻度构音障碍可获得3-5%的词错误率,接近典型语音基准,而重度构音障碍在所有系统中均超过49%的词错误率。逐字转录提示策略产生架构特异性效应:GPT-4o实现了7.36个百分点的词错误率降低,且在所有测试说话人中表现一致提升,而Gemini变体则出现性能退化。语义指标表明,尽管词汇错误率升高,沟通意图仍能部分恢复。本研究建立的实证基线可为辅助性语音接口部署提供基于证据的技术选择依据。