The emergence of multimodal foundation models has revolutionized learning paradigms by enabling joint understanding across diverse data types. In the context of next-generation wireless networks, integrating sensing and communication modalities presents a unique opportunity to develop generalizable and data-efficient models. In this work, we introduce the contrastive learning based Wireless Multimodal Foundation Model (WMFM), a large-scale framework that jointly learns from wireless channel coefficients and visual imagery. The WMFM is pretrained using contrastive learning, a self-supervised learning technique that aligns embeddings of camera and channel data without requiring explicit labels. The pretrained encoders are then frozen and employed as feature extractors, with lightweight task-specific heads, fine-tuned for downstream tasks, including user localization and LoS/nLoS classification. Extensive experiments on the DeepVerse6G dataset demonstrate that the proposed WMFM achieves a 17% improvement in balanced accuracy for LoS/nLoS classification and a 48.5% reduction in localization error compared to the end-to-end (E2E) benchmark, while reducing training time by up to 90-fold. Even when trained with as little as 20% of the data, the WMFM-based heads outperform the fully supervised E2E model, underscoring their robustness and data-efficient learning. The proposed approach establishes a foundation for scalable, multimodal learning in Integrated Sensing and Communication (ISAC) systems, paving the way for intelligent and adaptive 6G networks.
翻译:多模态基础模型的出现通过实现跨异构数据类型的联合理解,彻底革新了学习范式。在下一代无线网络的背景下,融合感知与通信模态为开发可泛化且数据高效的模型提供了独特机遇。本文提出基于对比学习的无线多模态基础模型(WMFM),这是一个能够从无线信道系数与视觉图像中联合学习的大规模框架。WMFM采用自监督对比学习进行预训练,该方法无需显式标注即可对齐相机数据与信道数据的嵌入表示。预训练的编码器随后被冻结并作为特征提取器,配合轻量级任务特定头部,针对下游任务(包括用户定位与视距/非视距分类)进行微调。在DeepVerse6G数据集上的大量实验表明,所提出的WMFM在视距/非视距分类任务中平衡准确率提升17%,定位误差较端到端基准模型降低48.5%,同时训练时间最高减少90倍。即使仅使用20%的数据进行训练,基于WMFM的头部仍能超越全监督端到端模型,凸显了其鲁棒性与数据高效学习能力。该研究为集成感知与通信系统中的可扩展多模态学习奠定了理论基础,为构建智能化自适应6G网络开辟了道路。