We describe the Phonexia submission for the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC-21) in the unsupervised speaker verification track. Our solution was very similar to IDLab's winning submission for VoxSRC-20. An embedding extractor was bootstrapped using momentum contrastive learning, with input augmentations as the only source of supervision. This was followed by several iterations of clustering to assign pseudo-speaker labels that were then used for supervised embedding extractor training. Finally, a score fusion was done, by averaging the zt-normalized cosine scores of five different embedding extractors. We briefly also describe unsuccessful solutions involving i-vectors instead of DNN embeddings and PLDA instead of cosine scoring.
翻译:我们描述了在无人监督的演讲者校验轨道上用于 VoxCeleb 2021 Vox-Celeb 演讲者承认挑战的Phonexia 呈件( VoxSRC-21) 。 我们的解决方案与IDLab 赢得的 VoxSRC-20 演示非常相似。 一个嵌入提取器是用动力对比学习来支撑的, 输入增强作为唯一的监督来源。 之后又进行了几次集群循环, 以指定假口语标签, 然后用于监督嵌入提取器的培训。 最后, 通过将五个不同的嵌入提取器的zt- 常规共弦分数平均, 完成了一个得分组合。 我们还简要描述了涉及i- Vectors 而不是 DNN 嵌入和 PLDA 而不是 Cosine 评分的失败解决方案 。