Recently, end-to-end speech translation (ST) has gained significant attention as it avoids error propagation. However, the approach suffers from data scarcity. It heavily depends on direct ST data and is less efficient in making use of speech transcription and text translation data, which is often more easily available. In the related field of multilingual text translation, several techniques have been proposed for zero-shot translation. A main idea is to increase the similarity of semantically similar sentences in different languages. We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data. We investigate the effects of data augmentation and auxiliary loss function. The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
翻译:最近,端对端语音翻译(ST)因避免错误传播而引起极大关注,然而,这一方法缺乏数据,严重依赖直接ST数据,在使用通常比较容易获得的语音转录和文本翻译数据方面效率较低;在多语种文本翻译相关领域,提议采用若干技术进行零发翻译;一个主要想法是增加不同语言的语义相似句子的相似性;我们研究这些想法是否可以适用于语言翻译,方法是建立经语言转录和文本翻译数据培训的ST模型;我们调查数据增强和附带损失功能的影响;利用有限的ST数据成功地将这些技术应用于少发的ST,与直接端对端ST和+3.1 BLEU点数相比,改进幅度高达+12.9 BLEU点数,而ST模型则从ASR模型进行精细调整。