Multi-modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an $\ell_0$-regularized multi-modal convolutional sparse coding (MCSC) model. Specifically, for solving the $\ell_0$-regularized CSC problem, we design a learnable $\ell_0$-regularized sparse coding (LZSC) block in a principled manner through deep unfolding. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an $\ell_0$-regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet's training. Extensive experiments show that FNet achieves high-quality fusion results across eight different MMIF datasets. Furthermore, we show that FNet enhances downstream object detection \textcolor[rgb]{ 0, 0, 0}{and semantic segmentation} in visible-thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network. Link for code and models: https://github.com/gargi884/FNet-MMIF.
翻译:多模态图像融合(MMIF)通过结合来自不同模态传感器图像获取的独特特征与共有特征,增强融合图像的信息含量,从而改善可视化、目标检测及更多任务。本文提出一种基于ℓ0正则化多模态卷积稀疏编码(MCSC)模型的可解释网络,命名为FNet,用于MMIF任务。具体而言,为求解ℓ0正则化CSC问题,我们通过深度展开以原理性方式设计了一个可学习的ℓ0正则化稀疏编码(LZSC)模块。给定不同模态的源图像,FNet首先利用LZSC模块从中分离出独特特征与共有特征,随后将这些特征融合以生成最终的融合图像。此外,我们提出了一个用于逆向融合过程的ℓ0正则化MCSC模型。基于该模型,我们引入了一个可解释的逆向融合网络IFNet,该网络在FNet训练过程中被使用。大量实验表明,FNet在八个不同的MMIF数据集上均实现了高质量的融合结果。进一步地,我们证明FNet能增强可见光-热成像图像对中的下游目标检测和语义分割性能。我们还可视化了FNet的中间结果,这证明了我们网络具有良好的可解释性。代码与模型链接:https://github.com/gargi884/FNet-MMIF。