As audio-visual systems are being deployed for safety-critical tasks such as surveillance and malicious content filtering, their robustness remains an under-studied area. Existing published work on robustness either does not scale to large-scale dataset, or does not deal with multiple modalities. This work aims to study several key questions related to multi-modal learning through the lens of robustness: 1) Are multi-modal models necessarily more robust than uni-modal models? 2) How to efficiently measure the robustness of multi-modal learning? 3) How to fuse different modalities to achieve a more robust multi-modal model? To understand the robustness of the multi-modal model in a large-scale setting, we propose a density-based metric, and a convexity metric to efficiently measure the distribution of each modality in high-dimensional latent space. Our work provides a theoretical intuition together with empirical evidence showing how multi-modal fusion affects adversarial robustness through these metrics. We further devise a mix-up strategy based on our metrics to improve the robustness of the trained model. Our experiments on AudioSet and Kinetics-Sounds verify our hypothesis that multi-modal models are not necessarily more robust than their uni-modal counterparts in the face of adversarial examples. We also observe our mix-up trained method could achieve as much protection as traditional adversarial training, offering a computationally cheap alternative. Implementation: https://github.com/lijuncheng16/AudioSetDoneRight
翻译:由于视听系统正在用于监视和恶意内容过滤等安全关键任务,因此其稳健性仍然是一个研究不足的领域。关于稳健性的现有出版工作要么没有大规模数据集的规模,要么没有处理多种模式。这项工作的目的是研究与通过稳健性视角进行多模式学习有关的几个关键问题:(1) 多模式模式是否必然比单一模式模型更强大?(2) 如何有效地衡量多模式学习的稳健性?(3) 如何结合不同模式,以实现更健全的多模式模式?为了在大规模设置中理解多模式的稳健性,我们建议采用基于密度的衡量标准,以及一个精确度度指标,以便有效地衡量在高维度潜质空间中每种模式的分布情况。我们的工作提供了理论直观和实证证据,表明多模式融合如何通过这些计量影响对抗性强性强性。我们进一步根据我们的标准制定混合战略,以改善经过培训的模型的稳健健性多模式;我们关于视听和虚拟模式的实验,也证明我们作为经培训的稳健性数字模型的可靠性标准。