The remote sensing community has recently seen the emergence of methods based on Large Vision and Language Models (LVLMs) that can address multiple tasks at the intersection of computer vision and natural language processing. To fully exploit the potential of such models, a significant focus has been given to the collection of large amounts of training data that cover multiple remote sensing-specific tasks, such as image captioning or visual question answering. However, the cost of using and training LVLMs is high, due to the large number of parameters. While multiple parameter-efficient adaptation techniques have been explored, the computational costs of training and inference with these models can remain prohibitive for most institutions. In this work, we explore the use of encoder-only architectures and propose a model that can effectively address multi-task learning while remaining compact in terms of the number of parameters. In particular, our model tackles combinations of tasks that are not typically explored in a unified model: the generation of text from remote sensing images and cross-modal retrieval. The results of our GeoMELT model - named from Multi-task Efficient Learning Transformer - in established benchmarks confirm the efficacy and efficiency of the proposed approach.
翻译:遥感领域近期涌现出基于大型视觉语言模型(LVLMs)的方法,这些方法能够处理计算机视觉与自然语言处理交叉领域的多项任务。为充分挖掘此类模型的潜力,研究重点已转向收集涵盖多种遥感特定任务(如图像描述生成或视觉问答)的大规模训练数据。然而,由于参数量庞大,LVLMs的训练与使用成本较高。尽管已有多种参数高效适配技术被探索,但此类模型的训练与推理计算成本对多数机构而言仍难以承受。本研究探索了仅编码器架构的应用,并提出一种能够在保持参数量紧凑的同时有效处理多任务学习的模型。特别地,我们的模型解决了通常不在统一模型中探索的任务组合:从遥感图像生成文本以及跨模态检索。我们的GeoMELT模型(源自多任务高效学习Transformer)在现有基准测试中的结果验证了所提方法的有效性与高效性。