Multimodal Sentiment Analysis (MSA) aims to infer human sentiment by integrating information from multiple modalities such as text, audio, and video. In real-world scenarios, however, the presence of missing modalities and noisy signals significantly hinders the robustness and accuracy of existing models. While prior works have made progress on these issues, they are typically addressed in isolation, limiting overall effectiveness in practical settings. To jointly mitigate the challenges posed by missing and noisy modalities, we propose a framework called Two-stage Modality Denoising and Complementation (TMDC). TMDC comprises two sequential training stages. In the Intra-Modality Denoising Stage, denoised modality-specific and modality-shared representations are extracted from complete data using dedicated denoising modules, reducing the impact of noise and enhancing representational robustness. In the Inter-Modality Complementation Stage, these representations are leveraged to compensate for missing modalities, thereby enriching the available information and further improving robustness. Extensive evaluations on MOSI, MOSEI, and IEMOCAP demonstrate that TMDC consistently achieves superior performance compared to existing methods, establishing new state-of-the-art results.
翻译:多模态情感分析旨在通过整合文本、音频和视频等多种模态的信息来推断人类情感。然而,在实际场景中,模态缺失与信号噪声的存在显著影响了现有模型的鲁棒性与准确性。尽管先前的研究已在这些问题上取得进展,但通常孤立地处理这些问题,限制了在实际应用中的整体效能。为共同应对缺失与噪声模态带来的挑战,本文提出了一种名为两阶段模态去噪与补全的框架。TMDC包含两个顺序的训练阶段。在模态内去噪阶段,通过专用去噪模块从完整数据中提取去噪后的模态特定与模态共享表示,以降低噪声影响并增强表示的鲁棒性。在模态间补全阶段,利用这些表示来补偿缺失模态,从而丰富可用信息并进一步提升鲁棒性。在MOSI、MOSEI和IEMOCAP数据集上的广泛评估表明,TMDC相较于现有方法持续取得更优性能,确立了新的最先进结果。