This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
翻译:本文介绍了KIT为IWSLT 2025低资源赛道提交的系统。我们针对三种语言对(本巴语、北黎凡特阿拉伯语和突尼斯阿拉伯语至英语)开发了级联系统(包含自动语音识别(ASR)与机器翻译(MT)模型)以及端到端(E2E)语音翻译(ST)系统。基于预训练模型,我们采用不同策略对系统进行微调以高效利用资源。本研究进一步探索了通过合成数据和模型正则化增强系统性能的方法。具体而言,我们研究了利用MT模型从ASR数据生成翻译的MT增强ST方法。对于缺乏并行ST训练数据的北黎凡特阿拉伯语,仅使用合成数据训练的系统略优于基于真实数据训练的级联系统。我们还探索了通过文本转语音模型从MT数据生成合成语音的增强方法,证明了合成数据在提升本巴语的ASR和ST性能方面的优势。此外,我们应用内部蒸馏技术以提升模型性能。实验表明,该方法在ASR、MT和ST任务以及不同预训练模型中均能持续改善结果。最后,我们采用最小贝叶斯风险解码融合级联与端到端系统,实现了约1.5个BLEU分的提升。