With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
翻译:随着Transformer在单个遥感任务上取得卓越性能,我们正逐步实现通过多任务学习构建一个在多项任务上均表现优异的统一模型。与单任务方法相比,多任务学习方法具有更好的泛化能力、更强的可扩展性以及更高的实际应用价值。近年来,视觉语言模型分别在遥感图像理解、定位及超高分辨率图像推理任务中取得了显著成果。此外,基于文本的统一接口展现出在多任务学习中的巨大潜力。为此,本文提出RSCoVLM——一个简洁而灵活的遥感多任务学习视觉语言模型基线。首先,我们构建了数据治理引擎,涵盖数据采集、离线处理与整合,以及在线加载与加权等环节。该引擎有效应对了复杂的遥感数据环境,并生成灵活的视觉-语言对话数据。进一步,我们提出统一动态分辨率策略以处理遥感图像固有的多尺度特性。针对超高分辨率图像,我们引入Zoom-in Chain机制及其对应数据集LRS-VQA-Zoom。这些策略具有灵活性,能有效缓解计算负担。此外,我们显著提升了模型的物体检测能力,并提出一种新的评估协议,确保视觉语言模型与传统检测模型之间的公平比较。大量实验表明,RSCoVLM在多项任务中均达到最先进性能,超越现有遥感视觉语言模型,甚至可与专业专家模型相媲美。所有训练评估工具、模型权重及数据集均已完全开源以支持可复现性。我们期望该基线能推动通用遥感模型的进一步发展。