We present a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The audio data is read speech and thus low in disfluencies. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that speech alignment quality is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for end-to-end speech translation for German.
翻译:我们以德文音频书籍为基础,提出了一整套德文音频、德文文本和英文译文,三译文,三译以德语音频、德文文本和英译文为基础,由100多小时的音频材料和50多句平行句子组成,音频数据是读话,因此是低调的。音频和句子校正的质量通过人工评估进行了检查,显示语音校正质量一般很高。句子校正质量与使用良好的平行翻译数据相当,可以通过自动校正评分的分数进行调整。据我们所知,该文系迄今是德语端对端语音翻译的最大资源。