While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
翻译:尽管端到端(E2E)自动语音识别(ASR)模型在通用转录任务上表现出色,但在识别罕见或未见过的命名实体(如联系人姓名、地点)时仍存在困难,而这些实体对于虚拟助手等下游应用至关重要。本文提出了一种基于注意力机制的编码器-解码器(AED)模型的上下文偏置方法,该方法利用候选命名实体列表进行优化。与仅预测下一个标记的传统方法不同,我们同时预测多个未来标记,使模型能够“窥视未来”并对实体列表中的潜在候选实体进行评分。此外,我们的方法直接利用多标记预测的对数概率,无需额外的实体编码器或交叉注意力层,从而显著降低了架构复杂度。在Librispeech数据集上的实验表明,与基线AED模型相比,该方法在命名实体词错误率上实现了最高50.34%的相对提升。