Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for spoof detection for both logical and physical access. The novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is that it can handle variable length utterances during testing. Performance of the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is analyzed on the ASV-spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detection cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for physical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-level feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on evaluation data with a relative improvement of 48.03\% and 49.47\% for both logical and physical access, respectively.
翻译:检测潜伏语言是基于声音的49种生物测定中的一个基本问题。 拟建的TD-SNN系统相对于常规的DNN系统的新颖之处在于,它可以在测试期间处理不同长度的语句。 拟议的TD-SNN系统以及基基高斯混合模型(GMS)的性能在ASV-spoof-2019数据集的启发下进行了分析。 系统的性能根据最小的标准化同步检测成本功能(min-t-DCF)来衡量。 与个别特征研究时,TD-SNN系统在物理访问方面始终超越GM系统。 关于逻辑访问,拟议的TD-SNNNS系统和基线混合模型(GM-S-S-BS)的性能在最佳访问水平上都超过了最佳访问系统(IMM-NDS-NF) 和最佳访问系统(TRM-S-S-S-S-SF) 的逻辑性能,在最佳访问和最佳访问模式上都超越了最佳访问系统。