Camera and radar sensors have significant advantages in cost, reliability, and maintenance compared to LiDAR. Existing fusion methods often fuse the outputs of single modalities at the result-level, called the late fusion strategy. This can benefit from using off-the-shelf single sensor detection algorithms, but late fusion cannot fully exploit the complementary properties of sensors, thus having limited performance despite the huge potential of camera-radar fusion. Here we propose a novel proposal-level early fusion approach that effectively exploits both spatial and contextual properties of camera and radar for 3D object detection. Our fusion framework first associates image proposal with radar points in the polar coordinate system to efficiently handle the discrepancy between the coordinate system and spatial properties. Using this as a first stage, following consecutive cross-attention based feature fusion layers adaptively exchange spatio-contextual information between camera and radar, leading to a robust and attentive fusion. Our camera-radar fusion approach achieves the state-of-the-art 41.1% mAP and 52.3% NDS on the nuScenes test set, which is 8.7 and 10.8 points higher than the camera-only baseline, as well as yielding competitive performance on the LiDAR method.
翻译:与LiDAR相比,现有的聚合方法在成本、可靠性和保养方面有很大优势。 现有的聚合方法往往在结果层面将单一模式的输出结果结合在一起,称为延迟融合战略。 使用现成的单一传感器检测算法可以对此有所帮助,但延迟融合不能充分利用传感器的互补特性,因此尽管摄像-雷达融合潜力巨大,但性能却有限。 我们在此提议一种新的建议级早期融合方法,有效地利用相机和雷达的空间和背景特性进行3D物体探测。 我们的聚合框架首先将雷达点与极地协调系统中的雷达点联系起来,以便有效地处理协调系统与空间特性之间的差异。 在连续的基于特性的交叉使用相异感应层后,延迟融合无法充分利用传感器的互补特性,从而导致摄像-雷达之间发生强力和关注的融合。 我们的摄像-雷达聚合方法在核扫描测试组上实现了最新状态的41.1% mAP和52.3% NDSS, 该测试组的雷达点与极地协调系统中的雷达点与雷达协调系统和空间特性之间的差异。 在相光学基线上,比摄像-AR基准上具有竞争力的性性性性性性能,将8.7和10.8D点作为高。