MindDiffuser: 可控的脑活动源图片重建，使用语义和结构扩散 (MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion)

Reconstructing visual stimuli from measured functional magnetic resonance imaging (fMRI) has been a meaningful and challenging task. Previous studies have successfully achieved reconstructions with structures similar to the original images, such as the outlines and size of some natural images. However, these reconstructions lack explicit semantic information and are difficult to discern. In recent years, many studies have utilized multi-modal pre-trained models with stronger generative capabilities to reconstruct images that are semantically similar to the original ones. However, these images have uncontrollable structural information such as position and orientation. To address both of the aforementioned issues simultaneously, we propose a two-stage image reconstruction model called MindDiffuser, utilizing Stable Diffusion. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into the image-to-image process of Stable Diffusion, which yields a preliminary image that contains semantic and structural information. In Stage 2, we utilize the low-level CLIP visual features decoded from fMRI as supervisory information, and continually adjust the two features in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our proposed model has surpassed the current state-of-the-art models in terms of reconstruction results on Natural Scenes Dataset (NSD). Furthermore, the results of ablation experiments indicate that each component of our model is effective for image reconstruction.

翻译：从功能磁共振成像（fMRI）测量中重建视觉刺激一直是一项有意义的挑战性工作。之前的研究成功地实现了与原始图像类似的结构重建，例如一些自然图像的轮廓和大小。然而，这些重建缺乏明确的语义信息，很难辨别。近年来，许多研究利用具有更强的生成能力的多模式预先训练模型重建了在语义上类似于原始图像的图像。然而，这些图像具有不可控的结构信息，例如位置和方向。为了同时解决上述问题，我们提出了一种名为MindDiffuser的双阶段图像重建模型，利用Stable Diffusion。在第一阶段中，从fMRI解码出的VQ-VAE潜在表示和CLIP文本嵌入将被放入Stable Diffusion的图像到图像过程中，这将产生一个包含语义和结构信息的初步图像。在第二阶段中，我们利用从fMRI解码出的低级CLIP视觉特征作为监督信息，并通过反向传播不断调整第一阶段中的两个特征以对齐结构信息。定性和定量分析的结果表明，我们提出的模型在自然场景数据集（NSD）的重建结果方面已经超过了目前的最先进模型。此外，消融实验的结果表明，我们模型的每个组件都对图像重建有效。