In this work, we compare from-scratch sequence-level cross-entropy (full-sum) training of Hidden Markov Model (HMM) and Connectionist Temporal Classification (CTC) topologies for automatic speech recognition (ASR). Besides accuracy, we further analyze their capability for generating high-quality time alignment between the speech signal and the transcription, which can be crucial for many subsequent applications. Moreover, we propose several methods to improve convergence of from-scratch full-sum training by addressing the alignment modeling issue. Systematic comparison is conducted on both Switchboard and LibriSpeech corpora across CTC, posterior HMM with and w/o transition probabilities, and standard hybrid HMM. We also provide a detailed analysis of both Viterbi forced-alignment and Baum-Welch full-sum occupation probabilities.
翻译:在这项工作中,我们比较了隐藏Markov模型(HMM)和连接时间分类(CTC)自动语音识别(ASR)表层(ASR)的跨孔径(全和)级跨孔径(全和)培训。除了准确性外,我们还进一步分析其在语音信号和转录之间产生高质量时间调整的能力,这对许多后续应用至关重要。此外,我们提出了几种方法,通过处理校准模型问题,改进从疏通全和培训的趋同性。在CTC、HMM的后台和LibriSpeech Corpoora、HMM和/或交接性概率以及标准混合 HMM(HM)之间进行系统比较。我们还详细分析了Viterbi强迫关系和Baum-Welch全部职业的概率。 我们还详细分析了Viterbi强迫关系和Baum-Welch全部职业概率。