In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages -- not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
翻译:在这项工作中,我们力求在零点设置下,建立有效的密码开关自动语音识别系统(CS),在零点设置下,不提供可转录的 CS 语音数据供培训使用。将双语任务有条件地纳入单语部分的拟议框架是有效利用单语数据的一个很有希望的起点。然而,这些方法要求单语模块进行语言分割。即,每个单语模块必须同时检测CS 点并抄录一种语言的语音部分,同时忽略其他语言的部分 -- -- 而不是微不足道的任务。我们提议简化每个单语模块,允许它们用单语文字(即转写)任意抄录所有语音部分。这种简单修改将CS的检测责任转到随后的双语模块,通过考虑多种单语翻写和外部语言模式信息来决定最终输出。我们将这种基于转写的方法应用在终端到终端不同的神经网络中,并展示其对曼达林-英语SEEAME测试器零射 CSR的效果。