Humans have an innate ability to sense their surroundings, as they can extract the spatial representation from the egocentric perception and form an allocentric semantic map via spatial transformation and memory updating. However, endowing mobile agents with such a spatial sensing ability is still a challenge, due to two difficulties: (1) the previous convolutional models are limited by the local receptive field, thus, struggling to capture holistic long-range dependencies during observation; (2) the excessive computational budgets required for success, often lead to a separation of the mapping pipeline into stages, resulting the entire mapping process inefficient. To address these issues, we propose an end-to-end one-stage Transformer-based framework for Mapping, termed Trans4Map. Our egocentric-to-allocentric mapping process includes three steps: (1) the efficient transformer extracts the contextual features from a batch of egocentric images; (2) the proposed Bidirectional Allocentric Memory (BAM) module projects egocentric features into the allocentric memory; (3) the map decoder parses the accumulated memory and predicts the top-down semantic segmentation map. In contrast, Trans4Map achieves state-of-the-art results, reducing 67.2% parameters, yet gaining a +3.25% mIoU and a +4.09% mBF1 improvements on the Matterport3D dataset. Code at: https://github.com/jamycheung/Trans4Map.
翻译:人类具有感知周围环境的内在能力,因为它们可以从自我中心感知中提取空间代表,并通过空间转换和记忆更新形成一个全方位的语义图。然而,赋予移动剂以这种空间感测能力仍然是一项挑战,原因有两个:(1) 以前的共变模型受当地接受场的限制,因此在观察期间难以捕捉到整体的远程依赖性;(2) 成功所需的过度计算预算,往往导致将绘图管道分为各个阶段,从而导致整个绘图进程效率低下。为了解决这些问题,我们提议了一个以末端为端的一阶段变换器框架,称为 Trans4Map。我们以自我中心为中心的绘图进程包括三个步骤:(1) 高效变压器从一组自我中心图像中提取背景特征;(2) 拟议的双向偏心记忆模块项目(BAM) 将自我中心特征分离成全方位记忆;(3) 地图解码对累积的记忆进行分解,并预测上至下端的语义断段图。 在对比中, Transad4MMSMMMMS 实现州结果。