HENet++：面向三维感知与端到端自动驾驶的混合编码与多任务学习 (HENet++: Hybrid Encoding and Multi-task Learning for 3D Perception and End-to-end Autonomous Driving)

Three-dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird's-eye-view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high-resolution images, and long-term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end-to-end inference across multiple tasks while maintaining accuracy comparable to that of single-task models. To alleviate these issues, we present the HENet and HENet++ framework for multi-task 3D perception and end-to-end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short-term frames and a small one for long-term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state-of-the-art end-to-end multi-task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end-to-end autonomous driving benchmark.

翻译：三维特征提取是自动驾驶系统的关键组成部分，其中三维物体检测、鸟瞰图（BEV）语义分割和占据预测等感知任务对三维特征形成重要约束。尽管大型图像编码器、高分辨率图像和长时序输入能显著提升特征质量并带来显著的性能增益，但这些技术在训练和推理中常因计算资源限制而难以兼容。此外，不同任务偏好不同的特征表示，使得单一模型难以在保持与单任务模型相当精度的同时，执行跨多任务的端到端推理。为缓解这些问题，我们提出了面向多任务三维感知与端到端自动驾驶的HENet和HENet++框架。具体而言，我们设计了一种混合图像编码网络，使用大型图像编码器处理短期帧，小型编码器处理长期帧。进一步，我们的框架同时提取稠密和稀疏特征，为不同任务提供更合适的表示，减少累积误差，并为规划模块提供更全面的信息。所提架构保持与多种现有三维特征提取方法的兼容性，并支持多模态输入。HENet++在nuScenes基准测试中取得了最先进的端到端多任务三维感知结果，同时在nuScenes端到端自动驾驶基准测试中实现了最低碰撞率。