Deepfake audio detection has progressed rapidly with strong pre-trained encoders (e.g., WavLM, Wav2Vec2, MMS). However, performance in realistic capture conditions - background noise (domestic/office/transport), room reverberation, and consumer channels - often lags clean-lab results. We survey and evaluate robustness for state-of-the-art audio deepfake detection models and present a reproducible framework that mixes MS-SNSD noises with ASVspoof 2021 DF utterances to evaluate under controlled signal-to-noise ratios (SNRs). SNR is a measured proxy for noise severity used widely in speech; it lets us sweep from near-clean (35 dB) to very noisy (-5 dB) to quantify graceful degradation. We study multi-condition training and fixed-SNR testing for pretrained encoders (WavLM, Wav2Vec2, MMS), reporting accuracy, ROC-AUC, and EER on binary and four-class (authenticity x corruption) tasks. In our experiments, finetuning reduces EER by 10-15 percentage points at 10-0 dB SNR across backbones.
翻译:基于强大预训练编码器(如WavLM、Wav2Vec2、MMS)的音频深度伪造检测技术发展迅速。然而,在真实采集条件下——背景噪声(家庭/办公室/交通工具)、房间混响及消费级传输通道——其性能常落后于洁净实验室环境的结果。本文系统综述并评估了当前先进的音频深度伪造检测模型的鲁棒性,提出了一个可复现的框架,该框架将MS-SNSD噪声与ASVspoof 2021 DF语音样本混合,以在受控信噪比条件下进行评估。信噪比作为噪声严重程度的量化指标,在语音领域广泛应用;它使我们能够从接近洁净(35 dB)到极嘈杂(-5 dB)的范围内扫描,以量化性能的渐进退化。我们研究了预训练编码器(WavLM、Wav2Vec2、MMS)的多条件训练与固定信噪比测试,并在二分类及四分类(真实性×损坏类型)任务上报告了准确率、ROC-AUC与等错误率。实验结果表明,在10-0 dB信噪比范围内,微调可使不同骨干网络的等错误率降低10-15个百分点。