Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.
翻译:尽管大型视觉语言模型(LVLMs)在医学领域正被积极探索,但其结合准确诊断与专业对话进行复杂现实世界远程医疗咨询的能力仍未得到充分研究。本文提出了3MDBench(医学多模态多智能体对话基准),这是一个用于模拟和评估LVLM驱动的远程医疗咨询的开源框架。3MDBench通过基于气质的患者智能体模拟患者变异性,并通过评估智能体评估诊断准确性和对话质量。该基准包含来自真实世界远程医疗交互的34种诊断共2996个病例,结合了文本和图像数据。实验研究比较了广泛使用的开源和闭源LVLMs的诊断策略。我们证明,带有内部推理的多模态对话相比非对话设置将F1分数提高了6.5%,突显了上下文感知、信息寻求式提问的重要性。此外,将来自诊断卷积神经网络的预测注入LVLM的上下文可将F1分数提升高达20%。源代码可在https://github.com/univanxx/3mdbench获取。