METransformer: 通过多个可学习专家令牌的Transformer实现放射学报告生成 (METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens)

In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance.

翻译：在临床场景中，多专家会诊可以显著提升诊断水平，特别是对于错综复杂的病例。这启发我们探索一种“多专家联合诊断”机制，以升级当前文献中常见的“单一专家”框架。为此，我们提出了METransformer，一种基于Transformer的方法来实现这个想法。我们方法的关键设计是在Transformer编码器和解码器中引入多个可学习的“专家”令牌。在编码器中，每个专家令牌与视觉令牌和其他专家令牌相互作用，以学习依据图像表示来注意不同的图像区域。通过一个正交损失，这些专家令牌被鼓励捕捉互补信息，并尽可能减少它们之间的重叠。在解码器中，每个被注意到的专家令牌指导输入单词和视觉令牌之间的交叉注意力，从而影响生成的报告。进一步发展了基于指标的专家投票策略来生成最终报告。通过多专家的概念，我们的模型享受集成方法的优点，但通过一种计算上更为高效且支持更复杂的专家之间交互的方式来实现。实验结果证明了我们提出的模型在两个广泛使用的基准测试上的良好性能。最后但同样重要的是，框架级别的创新使我们的工作可以整合现有的“单专家”模型的进展，以进一步提高其性能。