ECAPA-TDNN is currently the most popular TDNN-series model for speaker verification, which refreshed the state-of-the-art(SOTA) performance of TDNN models. However, one-dimensional convolution has a global receptive field over the feature channel. It destroys the time-frequency relevance of the spectrogram. Besides, as ECAPA-TDNN only has five layers, a much shallower structure compared to ResNet restricts the capability to generate deep representations. To further improve ECAPA-TDNN, we propose a progressive channel fusion strategy that splits the spectrogram across the feature channel and gradually expands the receptive field through the network. Secondly, we enlarge the model by extending the depth and adding branches. Our proposed model achieves EER with 0.718 and minDCF(0.01) with 0.0858 on vox1o, relatively improved 16.1\% and 19.5\% compared with ECAPA-TDNN-large.
翻译:ECAPA-TDNNN是目前最受欢迎的TDNN系列演讲者核查模式,它更新了TDNNN模型的最先进(SOTA)性能,然而,一维演化在地貌频道上有一个全球可接受场,它摧毁了光谱的时频相关性。此外,由于ECAPA-TDNNN只有五层,与ResNet相比,一个更浅的结构限制了进行深层陈述的能力。为了进一步改善ECANPA-TDNNN,我们提出了一种渐进式频道聚合战略,将光谱分割在地貌频道上,并通过网络逐步扩大可接收场。第二,我们扩大模型,扩大深度并增加分支。我们提议的模型在 vox1 上以0.718和mDCF(0.01)获得ER,在0.0858 上取得Vox1,相对改进了16.1<unk> 和19.5<unk> 。</s>