Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).
翻译:基于图像的三维目标检测旨在仅使用RGB图像来识别和定位三维空间中的物体,从而无需基于点云方法所需的昂贵深度传感器。现有的基于图像的方法面临两个关键挑战:实现高精度的方法通常需要密集的三维监督,而无需此类监督的方法则难以仅从图像中提取准确的几何信息。本文提出GVSynergy-Det,一种通过协同高斯-体素表征学习来增强三维检测的新框架。我们的核心见解是,连续的高斯表征和离散的体素表征捕获了互补的几何信息:高斯表征擅长建模细粒度的表面细节,而体素则提供结构化的空间上下文。我们引入了一种双表征架构,该架构:1)采用可泛化的高斯泼溅技术来提取用于检测任务的互补几何特征;2)开发了一种跨表征增强机制,利用高斯场的几何细节来丰富体素特征。与以往要么依赖耗时的逐场景优化、要么仅将高斯表征用于深度正则化的方法不同,我们的协同策略通过可学习的集成直接利用两种表征的特征,从而实现更精确的目标定位。大量实验表明,GVSynergy-Det在具有挑战性的室内基准测试中取得了最先进的结果,在ScanNetV2和ARKitScenes数据集上均显著优于现有方法,且无需任何深度或密集三维几何监督(例如点云或TSDF)。