Previous databases have been designed to further the development of fake audio detection. However, fake utterances are mostly generated by altering timbre, prosody, linguistic content or channel noise of original audios. They ignore a fake situation, in which the attacker manipulates an acoustic scene of the original audio with another forgery one. It will pose a major threat to our society if some people misuse the manipulated audio with malicious purpose. Therefore, this motivates us to fill in the gap. This paper designs such a dataset for scene fake audio detection (SceneFake). A manipulated audio in the SceneFake dataset involves only tampering the acoustic scene of an utterance by using speech enhancement technologies. We can not only detect fake utterances on a seen test set but also evaluate the generalization of fake detection models to unseen manipulation attacks. Some benchmark results are described on the SceneFake dataset. Besides, an analysis of fake attacks with different speech enhancement technologies and signal-to-noise ratios are presented on the dataset. The results show that scene manipulated utterances can not be detected reliably by the existing baseline models of ASVspoof 2019. Furthermore, the detection of unseen scene manipulation audio is still challenging.
翻译:先前的数据库设计是为了进一步开发假音觉,然而,假话主要是通过改变原始音频的音调、假音调、假音质、语言内容或频道噪音生成的。它们忽略了一种假情况,即攻击者操纵原始音频的音响场和另一个伪造的声场。如果有人出于恶意目的滥用被操纵的音频,这将对我们社会构成重大威胁。因此,这促使我们填补空白。本文设计了一个用于现场假音觉(SceneFake)的数据集。SceneFake数据集中被操纵的音频只涉及使用语音增强技术来改变发音的声场。我们不仅能在已见测试集中检测假的音频,而且还评估隐蔽操纵攻击的假音调模型的一般化情况。 Scenefake 数据集中描述了一些基准结果。此外,数据集上还展示了使用不同语音增强技术和信号对音频比的假攻击的分析。结果显示,现有ASVspoof 2019的基线模型仍然无法可靠地探测到现场。