多层语义对齐: 辅助放射学报告生成的方法 (Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation)

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.

翻译：自动化放射学报告生成吸引了极高的研究兴趣，因为它减轻了放射科医生的工作负担。然而，同时建立图像（例如胸部X线）及其相关报告之间的全局对应关系和图像补丁及关键词之间的局部对齐仍然具有挑战性。为此，我们提出了一种称为统一对齐细化 (UAR) 的方法，来学习多级跨模态对齐，引入了三种新颖的模块：隐空间统一器（LSU）、跨模态表示对齐器（CRA）和文本到图像细化器（TIR）。具体而言，LSU 将多模态数据统一为离散标记，使得它能够通过共享网络学习模态间的共同知识。跨模态表示对齐器通过一组正交基和双门机制先学习具有区分性的特征，然后在三元对比损失下全局地对齐图像和文本表示。TIR 通过使用可学习掩码来加强标记级别的局部对齐，将文本到图像的注意力校准。此外，我们设计了一个两阶段的训练过程，使 UAR 逐渐掌握不同层面的跨模态对齐，这模仿了放射科医生的工作流程：先一句一句地写，然后逐字检查。在 IU-Xray 和 MIMIC-CXR 基准数据集上的广泛实验和分析证明了我们的 UAR 相对于不同的最先进方法具有优越性。