Automatic speaker verification (ASV) has been widely used in the real life for identity authentication. However, with the rapid development of speech conversion, speech synthesis algorithms, ASV systems are vulnerable for spoof attacks. In recent years, there have many works about synthetic speech detection, researchers had proposed a number of anti-spoofing methods based on hand-crafted features to improve the detection accuracy and robustness of ASV systems. However, using hand-crafted features rather than raw waveform would lose certain information for anti-spoofing, which will reduce the detection performance of the system. Inspired by the promising performance of ConvNeXt in image classification tasks, we revise the ConvNeXt network architecture accordingly for spoof attacks detection task and propose a light weight end-to-end anti-spoofing model. By integrating the revised architecture with the channel attention block and using the focal loss function, the proposed model can focus on the most informative sub-bands of speech representations to improve the anti-spoofing performance and the difficult samples that are hard for models to classify. Experiments show that our proposed best single system could achieve an equal error rate of 0.75% and min-tDCF of 0.0212 for the ASVSpoof2019 LA evaluation dataset, which outperform the state-of-the-art systems.
翻译:由于语音转换、语音合成算法的迅速发展,ASV系统易受攻击。近年来,在合成语音检测方面,研究人员提出了许多基于手工制作特征的防伪方法,以提高ASV系统的检测准确性和稳健性。然而,使用手工制作的功能而不是原始波形将失去某些用于反潜伏的信息,这将降低该系统的检测性能。在ConvNeXt在图像分类任务中前景良好的表现的启发下,我们相应修改ConvXt网络结构,以完成对攻击的检测任务,并提出一个轻重端到端的反潜伏模型。通过将修订后的结构与频道关注区整合,并利用焦点损失功能,拟议的模型可以侧重于最有信息的小语音显示带,以改进反潜伏性性能和难以对模型进行分类的样本。实验显示,我们提议的AS-VDR5M-MAF系统的最佳单一比率,即AS-S-10的AS-S-S-S-S-xxxx-S-S-S-S-xxx-S-S-S-S-serg-serg-set AS-S-serg-serg-serg-S-S-serg-S-serg-serg-S-sl-sl-s-sxxxxxxxxxxxxxx-sxxxxxxxx最佳最佳最佳系统,该最佳的系统,其最佳的系统,可以实现一个平均差率率的系统,一个相同的错误率。