Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions. In this paper, we investigate it for multi-turn meetings containing multiple speakers using the Streaming Unmixing and Recognition Transducer (SURT) model, and show that naively extending the single-turn model to this harder setting incurs a performance penalty. As a solution, we propose the dual-path (DP) modeling strategy first used for time-domain speech separation. We experiment with LSTM and Transformer based DP models, and show that they improve word error rate (WER) performance while yielding faster convergence. We also explore training strategies such as chunk width randomization and curriculum learning for these models, and demonstrate their importance through ablation studies. Finally, we evaluate our models on the LibriCSS meeting data, where they perform competitively with offline separation-based methods.
翻译:迄今为止,对多谈话对话的分流承认仅就2个讲演单回合会议进行了评价。在本文中,我们调查多回合会议的情况,其中使用流成拼接和识别转换器(SURT)模式的多个发言者参加多回合会议,并表明将单回合模式向这种困难环境扩展是天真的,因此要受到性能处罚。作为一种解决办法,我们建议了首先用于时间-主讲分离的双路模式(DP)建模战略。我们试验了基于LSTM和变换器的DP模式,并表明它们提高了字差率(WER),同时取得了更快的趋同。我们还探索了诸如大宽幅随机化和这些模式的课程学习等培训战略,并通过减缩研究来展示其重要性。最后,我们评估了我们有关LibriCSS会议数据的模式,在这些模式中,它们以离线分离法进行竞争。