In recent years, sparse voxel-based methods have become the state-of-the-arts for 3D semantic segmentation of indoor scenes, thanks to the powerful 3D CNNs. Nevertheless, being oblivious to the underlying geometry, voxel-based methods suffer from ambiguous features on spatially close objects and struggle with handling complex and irregular geometries due to the lack of geodesic information. In view of this, we present Voxel-Mesh Network (VMNet), a novel 3D deep architecture that operates on the voxel and mesh representations leveraging both the Euclidean and geodesic information. Intuitively, the Euclidean information extracted from voxels can offer contextual cues representing interactions between nearby objects, while the geodesic information extracted from meshes can help separate objects that are spatially close but have disconnected surfaces. To incorporate such information from the two domains, we design an intra-domain attentive module for effective feature aggregation and an inter-domain attentive module for adaptive feature fusion. Experimental results validate the effectiveness of VMNet: specifically, on the challenging ScanNet dataset for large-scale segmentation of indoor scenes, it outperforms the state-of-the-art SparseConvNet and MinkowskiNet (74.6% vs 72.5% and 73.6% in mIoU) with a simpler network structure (17M vs 30M and 38M parameters). Code release: https://github.com/hzykent/VMNet
翻译:近些年来,由于强大的 3D CNN 3D 的强力 3D CNN, 稀有的Voxel 方法已成为3D 室内场景的语义分解的最先进艺术。 尽管如此, 3D CNN 的强力 3D CNN 却忽视了基本的几何, 以Voxel 为基础的方法在空间接近对象上有着模糊的特征, 并由于缺少大地测量信息而难以处理复杂和不正常的地貌。 有鉴于此, 我们展示了Voxel-Mesh 网络(VMNet 网络 ) 3D 的3D 深层结构, 利用 Euclidean 和 地标信息。 Intutional, 从 voxel 提取的 Euclidean 信息可以提供代表附近对象之间相互作用的背景线索, 而从 meshes 提取的大地测量信息可以帮助分离空间接近但表面不固定的物体。 为了纳入这两个域域域的这种信息, 我们设计了一个内部关注的模块, 用于有效的地物系集集集集集集, 以及一个用于适应性地基 大型地格 7M.6Mexal- sal- salMexmexmexmexmexmexmexmexmexmexmal 段段 。