Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM's inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.
翻译:遥感影像提供了海量且本质上非结构化的空间数据,需要复杂的推理来理解超越简单识别任务的复杂用户意图与上下文关系。本文旨在构建一个能够通过空间上下文与用户意图推理来处理复杂查询的地球观测工作流。作为一种推理工作流,它应能自主探索并构建自身的推理路径,而非受限于预定义的真实序列。理想情况下,其架构应当统一且泛化,具备通过单一模型执行多样化推理任务的能力,而无需额外微调。现有遥感方法依赖于监督微调范式与任务特定头部,限制了自主推理与统一泛化能力。为此,我们提出RemoteReasoner——一个统一的地理空间推理工作流。RemoteReasoner的设计整合了用于解析用户指令与定位目标的多模态大语言模型(MLLM),以及支持多粒度任务(包括对象级、区域级和像素级)的任务转换策略。与现有方法不同,我们的框架通过强化学习(RL)进行训练,赋予MLLM充分的推理自主性。在推理阶段,我们的转换策略能够实现多样化的任务输出格式,无需任务特定解码器或额外微调。实验表明,RemoteReasoner在多粒度推理任务上取得了最先进的性能。此外,它保留了MLLM固有的泛化能力,在未见任务与分布外类别上均表现出鲁棒性能。