The aim of multimodal neural networks is to combine diverse data sources, referred to as modalities, to achieve enhanced performance compared to relying on a single modality. However, training of multimodal networks is typically hindered by modality overfitting, where the network relies excessively on one of the available modalities. This often yields sub-optimal performance, hindering the potential of multimodal learning and resulting in marginal improvements relative to unimodal models. In this work, we present the Modality-Informed Learning ratE Scheduler (MILES) for training multimodal joint fusion models in a balanced manner. MILES leverages the differences in modality-wise conditional utilization rates during training to effectively balance multimodal learning. The learning rate is dynamically adjusted during training to balance the speed of learning from each modality by the multimodal model, aiming for enhanced performance in both multimodal and unimodal predictions. We extensively evaluate MILES on four multimodal joint fusion tasks and compare its performance to seven state-of-the-art baselines. Our results show that MILES outperforms all baselines across all tasks and fusion methods considered in our study, effectively balancing modality usage during training. This results in improved multimodal performance and stronger modality encoders, which can be leveraged when dealing with unimodal samples or absent modalities. Overall, our work highlights the impact of balancing multimodal learning on improving model performance.
翻译:多模态神经网络旨在融合多种数据源(即模态),以期获得优于单一模态的性能。然而,多模态网络的训练通常受到模态过拟合的阻碍,即网络过度依赖其中一种可用模态。这往往导致次优性能,限制了多模态学习的潜力,并使得相对于单模态模型的提升微乎其微。本文提出了一种模态感知学习率调度器(MILES),用于以平衡的方式训练多模态联合融合模型。MILES利用训练过程中各模态条件利用率的差异,有效平衡多模态学习。通过动态调整训练期间的学习率,MILES旨在平衡多模态模型从每个模态学习的速度,从而提升多模态与单模态预测的性能。我们在四个多模态联合融合任务上对MILES进行了广泛评估,并将其性能与七种先进基线方法进行了比较。结果表明,在本研究涉及的所有任务和融合方法中,MILES均优于所有基线,有效平衡了训练过程中的模态使用。这带来了多模态性能的提升以及更强大的模态编码器,后者可在处理单模态样本或模态缺失时发挥作用。总体而言,我们的工作凸显了平衡多模态学习对提升模型性能的重要影响。