In this paper, we propose a fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN). Given extracted speaker-discriminative embeddings (a.k.a. d-vectors) from input utterances, each individual speaker is modeled by a parameter-sharing RNN, while the RNN states for different speakers interleave in the time domain. This RNN is naturally integrated with a distance-dependent Chinese restaurant process (ddCRP) to accommodate an unknown number of speakers. Our system is fully supervised and is able to learn from examples where time-stamped speaker labels are annotated. We achieved a 7.6% diarization error rate on NIST SRE 2000 CALLHOME, which is better than the state-of-the-art method using spectral clustering. Moreover, our method decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
翻译:在本文中,我们提出了一个完全监督的发言者分解法,名为无约束的跨左状态经常神经网络(UIS-RNN),根据从输入语句中提取的语音分解嵌入(a.k.a.d-vectors),每个发言者都以参数共享RNN为模范,而不同发言者的RNN在时间域内互换。这个RNN自然地与远程依赖的中国餐馆进程(dddCRP)结合,以容纳人数不详的发言者。我们的系统受到充分监督,能够学习时间标注的发言者标签加注的例子。我们在NIST SRE 2000 CALHOME上实现了7.6%的分解误差率,这比使用光谱集群的先进方法要好。此外,我们的方法是以在线方式解码,而大多数最先进的系统依靠离线集。