Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. However, experiments are inconclusive on whether the cascade or the direct model is stronger, and have only been conducted under the unrealistic assumption that both are trained on equal amounts of data, ignoring other available speech recognition and machine translation corpora. In this paper, we demonstrate that direct speech translation models require more data to perform well than cascaded models, and while they allow including auxiliary data through multi-task training, they are poor at exploiting such data, putting them at a severe disadvantage. As a remedy, we propose the use of end-to-end trainable models with two attention mechanisms, the first establishing source speech to source text alignments, the second modeling source to target text alignment. We show that such models naturally decompose into multi-task-trainable recognition and translation tasks and propose an attention-passing technique that alleviates error propagation issues in a previous formulation of a model with two attention stages. Our proposed model outperforms all examined baselines and is able to exploit auxiliary training data much more effectively than direct attentional models.
翻译:传统上,语音翻译是通过级联模型进行的,其中包括在转录语音材料上受过培训的语音识别器,以及就平行文本受过培训的机器翻译系统。最近的一些著作表明,将级联转换成一个单一的直接模型是可行的,可以在翻译语音材料上进行端到端的培训,但是,关于级联还是直接模型是否更强,实验没有结论,只是根据不切实际的假设进行,即两者都接受同等数量的数据培训,忽视其他现有的语音识别和机器翻译公司。在本文中,我们表明直接语音翻译模型需要比级联模型更好的数据,同时它们允许通过多任务培训将辅助数据包括进来,但它们在利用这类数据方面做得很差,使其处于严重劣势。作为补救,我们提议使用端到端到端的模型,先是建立源文本校准源代码校准的源代码,第二个模型是目标文本校正的源码源码。我们指出,这类模型自然会纳入多任务可控缩式模型的模型,以便很好地运行,同时通过多任务将辅助数据包括多任务培训的模型翻译任务,因此,我们建议采用更能有效地利用前两个研究阶段的模型。我们提出的直接分析方法,以减轻前的模型。