End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon. In addition, word models may also be easier to integrate with downstream tasks such as spoken language understanding, because inference (search) is much simplified compared to phoneme, character or any other sort of sub-word units. In this paper, we describe methods to construct contextual acoustic word embeddings directly from a supervised sequence-to-sequence acoustic-to-word speech recognition model using the learned attention distribution. On a suite of 16 standard sentence evaluation tasks, our embeddings show competitive performance against a word2vec model trained on the speech transcriptions. In addition, we evaluate these embeddings on a spoken language understanding task, and observe that our embeddings match the performance of text-based embeddings in a pipeline of first performing speech recognition and then constructing word embeddings from transcriptions.
翻译:终端到终端声词语音识别模型最近由于易于培训而获得流行,其规模大大超过大量的培训数据,而且不需要词汇表。此外,字型模型可能更容易与下游任务整合,例如口语理解,因为与语音、字符或任何其他类型的子词单位相比,推论(搜索)大大简化了。在本文中,我们描述了如何利用所学的注意力分布,从受监督的顺序到顺序的声词语音识别模型直接构建背景声词嵌入。在一套16个标准句评价任务中,我们的嵌入显示与在语音记录上受过训练的文字2vec模型的竞争性性能。此外,我们评估这些嵌入语言理解任务的内容,并观察我们的嵌入与最初进行语音识别的管道中文字嵌入的功能相匹配,然后从抄录中构建文字嵌入的文字。