我们提出了VILLA,这是已知的第一个针对视觉和语言(V+L)表征学习的大规模对抗训练。VILLA由两个训练阶段组成: (一)任务不可知的对抗性预训练; 其次(二)针对具体任务进行对抗性微调。为了避免在图像像素和文本标记上增加对抗性扰动,我们建议在每个模态的嵌入空间中进行对抗性训练。为了实现大规模训练,我们采用了“free”对抗式训练策略,并与基于KL发散的正则化相结合,提高了嵌入空间的高不变性。我们将VILLA应用到目前表现最好的V+L模型中,并在广泛的任务中达到了新的水平,包括视觉问题回答、视觉常识推理、图像-文本检索、参考表达理解、视觉隐含和NLVR2。



The ever-increasing computational demand of Deep Learning has propelled research in special-purpose inference accelerators based on emerging non-volatile memory (NVM) technologies. Such NVM crossbars promise fast and energy-efficient in-situ matrix vector multiplications (MVM) thus alleviating the long-standing von Neuman bottleneck in today's digital hardware. However the analog nature of computing in these NVM crossbars introduces approximations in the MVM operations. In this paper, we study the impact of these non-idealities on the performance of DNNs under adversarial attacks. The non-ideal behavior interferes with the computation of the exact gradient of the model, which is required for adversarial image generation. In a non-adaptive attack, where the attacker is unaware of the analog hardware, we show that analog computing offers a varying degree of intrinsic robustness, with a peak adversarial accuracy improvement of 35.34%, 22.69%, and 31.70% for white box PGD ($\epsilon$=1/255, iter=30) for CIFAR-10, CIFAR-100, and ImageNet(top-5) respectively. We also demonstrate "hardware-in-loop" adaptive attacks that circumvent this robustness by utilizing the knowledge of the NVM model. To the best of our knowledge, this is the first work that explores the non-idealities of analog computing for adversarial robustness at the time of submission to NeurIPS 2020.