BERT: 多模式预科培训的愿景和语言互动 (InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining)

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

翻译：学习高层次多模式代表的多模式预培训是走向深层次学习和人工智能的又一个步骤。在这项工作中,我们提出了一个新的模式,即InterBERT(互动平台),这是我们一系列多式联运预培训方法M6(多模式至多模式多模式M6)的第一个模式。模型拥有不同模式信息流之间模拟互动的强大能力。单流互动模块能够有效处理多模式和人工智能信息,顶部双流模块维护了每种模式的独立性,以避免单一模式任务中的性能下降。我们以三种培训前任务预培训模式预设模式,包括蒙面段建模M6(MSMM)、蒙面建模(MRMM)和图像版本前匹配(ITMM);将模型的模型精细化为一系列基于愿景和语言的模型下游任务。实验结果表明,内部专家小组在中国模型中超越了一套最强的基线,包括最新的多模式前电子培训方法中,我们用最先进的电子格式预读取,并且分析显示,IMRMM和S-S-S-S-S-SIM-S-S-SIM-Ser-SIM-SIM-S-S-S-S-S-SIM-SIM-S-S-SIM-SIM-SIM-SIM-S-S-S-S-S-SIM-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SIR-M-M-M-M-M-M-M-M-M-SIR-M-SIR-SIR-M-M-M-M-S-S-SD-M-S-SD-M-SD-S-S-S-SD-S-S-S-S-S-SD-M-M-SD-S-M-SAR-M-SD-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SAR-SM-S-M-SM-S-S-SMA-SMA-SMA-SMA-S-S-S-S-MMM-S-S-S-S