The applicability of current lesion segmentation models for chest X-rays (CXRs) has been limited both by a small number of target labels and the reliance on long, detailed expert-level text inputs, creating a barrier to practical use. To address these limitations, we introduce a new paradigm: instruction-guided lesion segmentation (ILS), which is designed to segment diverse lesion types based on simple, user-friendly instructions. Under this paradigm, we construct MIMIC-ILS, the first large-scale instruction-answer dataset for CXR lesion segmentation, using our fully automated multimodal pipeline that generates annotations from chest X-ray images and their corresponding reports. MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. ROSALIA can segment diverse lesions and provide textual explanations in response to user instructions. The model achieves high segmentation and textual accuracy in our newly proposed task, highlighting the effectiveness of our pipeline and the value of MIMIC-ILS as a foundational resource for pixel-level CXR lesion grounding.
翻译:当前胸部X光(CXR)病灶分割模型的应用受到目标标签数量有限以及对冗长、详细专家级文本输入的依赖的限制,这构成了实际应用的障碍。为解决这些限制,我们引入了一种新范式:指令引导病灶分割(ILS),旨在基于简单、用户友好的指令分割多种病灶类型。在此范式下,我们构建了MIMIC-ILS,这是首个用于CXR病灶分割的大规模指令-答案数据集,使用我们全自动的多模态流程从胸部X光图像及其对应报告中生成标注。MIMIC-ILS包含从192K张图像和91K个独特分割掩码中提取的110万条指令-答案对,覆盖七种主要病灶类型。为实证验证其效用,我们引入了ROSALIA,这是一个在MIMIC-ILS上微调的视觉-语言模型。ROSALIA能够根据用户指令分割多种病灶并提供文本解释。该模型在我们新提出的任务中实现了高分割精度和文本准确性,突显了我们流程的有效性以及MIMIC-ILS作为像素级CXR病灶定位基础资源的价值。