Sprass4D: 多视图 3D 对象探测, 与分散的空间- 时间融合 (Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion)

Bird-eye-view (BEV) based methods have made great progress recently in multi-view 3D detection task. Comparing with BEV based methods, sparse based methods lag behind in performance, but still have lots of non-negligible merits. To push sparse 3D detection further, in this work, we introduce a novel method, named Sparse4D, which does the iterative refinement of anchor boxes via sparsely sampling and fusing spatial-temporal features. (1) Sparse 4D Sampling: for each 3D anchor, we assign multiple 4D keypoints, which are then projected to multi-view/scale/timestamp image features to sample corresponding features; (2) Hierarchy Feature Fusion: we hierarchically fuse sampled features of different view/scale, different timestamp and different keypoints to generate high-quality instance feature. In this way, Sparse4D can efficiently and effectively achieve 3D detection without relying on dense view transformation nor global attention, and is more friendly to edge devices deployment. Furthermore, we introduce an instance-level depth reweight module to alleviate the ill-posed issue in 3D-to-2D projection. In experiment, our method outperforms all sparse based methods and most BEV based methods on detection task in the nuScenes dataset.

翻译：以鸟眼视图为基础的方法在最近多视图 3D 探测任务中取得了巨大进展。与基于 BEV 的方法相比, 以零位为基础的方法在性能上落后, 但仍有许多非忽略的优点。为了进一步推进稀少的 3D 探测, 在这项工作中, 我们引入了一个新颖的方法, 名为 Sprase4D, 通过稀疏取样和爆破空间时空特征对锚箱进行迭接性改进。 (1) Sparse 4D 取样: 对于每个 3D 锚, 我们指定多个 4D 关键点, 然后再将多个多视图/ 比例/ 时间戳图像特性预测为样本对应特性 ; (2) 等级特征变异: 我们从等级上将不同视图/ 尺度、不同时间戳和不同关键点的抽样特性连接起来, 生成高质量实例特征。这样, Sprassy4D 可以高效地实现 3D 检测, 而不依赖于密度的视图转换或全球关注, 并且更方便于边缘装置的部署。此外, 我们引入一个实例深度深深深深重度重的模块模块模块模块模块模块,, 以3D 以基于最低的检测- 2D 任务投影式的S 。