C3-DINO: 联合对抗和非争议自我监督学习促进议长核查 (C3-DINO: Joint Contrastive and Non-contrastive Self-Supervised Learning for Speaker Verification)

Self-supervised learning (SSL) has drawn an increased attention in the field of speech processing. Recent studies have demonstrated that contrastive learning is able to learn discriminative speaker embeddings in a self-supervised manner. However, base contrastive self-supervised learning (CSSL) assumes that the pairs generated from a view of anchor instance and any view of other instances are all negative, which introduces many false negative pairs in constructing the loss function. The problem is referred as $class$-$collision$, which remains as one major issue that impedes the CSSL based speaker verification (SV) systems from achieving better performances. In the meanwhile, studies reveal that negative sample free SSL frameworks perform well in learning speaker or image representations. In this study, we investigate SSL techniques that lead to an improved SV performance. We first analyse the impact of false negative pairs in the CSSL systems. Then, a multi-stage Class-Collision Correction (C3) method is proposed, which leads to the state-of-the-art CSSL based speaker embedding system. On the basis of the pretrained CSSL model, we further propose to employ a negative sample free SSL objective (i.e., DINO) to fine-tune the speaker embedding network. The resulting speaker embedding system (C3-DINO) achieves 2.5% EER with a simple Cosine Distance Scoring method on Voxceleb1 test set, which outperforms the previous SOTA SSL system (4.86%) by a significant +45% relative improvement. With speaker clustering and pseudo labeling on Voxceleb2 training set, a LDA/CDS back-end applying on the C3-DINO speaker embeddings is able to further push the EER to 2.2%. Comprehensive experimental investigations of the Voxceleb benchmarks and our internal dataset demonstrate the effectiveness of our proposed methods, and the performance gap between the SSL SV and the supervised counterpart narrows further.

翻译：自监督学习(SSL)在语音处理领域引起更多关注。最近的研究显示,对比学习能够学习以自我监督的方式嵌入带有歧视性的演讲者。然而,基础对比自我监督学习(CSSL)假设,从锚点视图和其他实例中产生的对子都是负面的,这在构建丢失功能时引入了许多假的负对子。问题被称作$-NOSL$-collision,这仍然是阻碍基于 CSSL 的扬声器校验系统实现更好性能的一个主要问题。与此同时,研究显示,在学习扬声器或图像演示中,免费试样的 SSSSL框架表现良好。我们首先分析在 CSSL系统中的虚假负对子的影响。然后,将多级级级语言校正(C3) 进一步提出“S-S-L” 校正调法,然后将“S-SL” 校正S-SL 校正S-SL 校正一个测试系统,然后将“SSSSL” 校正前的SL 校正S-S-S-SL 演示SD 系统,然后将“SL” 演示S-S-SL 演示S-S-SL 演示S-S-SL 系统,然后将SL 演示S-S-SD-S-S-S-SDF-S-S-S-S-SD-S-S-SD-S-S-S-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-to-S-S-S-S-S-S-S-S-S-S-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-