End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
翻译:使用联合连接时空分类(CTC)-注意力损失培训的端对端语音识别模型最近受到欢迎。在这些模型中,由于速度和简单性,不偏向的CTC解码器经常在推论时间使用。然而,这些模型很难个性化,因为其有条件的独立假设使得输出符号无法从先前的时间步骤中影响未来预测。为了解决这一问题,我们提议了一种新型双向方法,首先将编码器偏向于稀有长尾和体外词的预设清单,然后在解码时使用动态推进和电话对齐网络来进一步偏向子词预测。我们评估了我们关于开放源码VoxPopuli和内部医疗数据集的方法,以展示F1在特定域的稀有字眼比重超过一个强大的CTC基线方面60%的改进率。