Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).
翻译:感知并自主导航穿越施工区域是一个具有挑战性且尚未充分探索的问题。针对这一长尾场景的开放数据集十分稀缺。我们提出了ROADWork数据集,旨在学习识别、观察、分析并穿越施工区域。当前最先进的基础模型在应用于施工区域时表现不佳。在我们的数据集上进行微调,显著提升了施工区域内的感知与导航能力。借助ROADWork数据集,我们以更高的精度(+32.5%)和更快的速率(12.8倍)在全球范围内发现了新的施工区域图像。开放词汇方法同样失效,而经过微调的检测器则提升了性能(+32.2 AP)。视觉-语言模型(VLMs)难以准确描述施工区域,但微调后性能大幅改善(+36.7 SPICE)。除了微调,我们还展示了简单技术的价值。视频标签传播为实例分割带来了额外增益(+2.6 AP)。在读取施工区域标志时,通过裁剪缩放组合检测器与文本定位器,性能提升了14.2%(1-NED)。组合施工区域检测结果以提供上下文信息,进一步减少了VLMs中的幻觉(+3.9 SPICE)。我们从施工区域视频中预测导航目标并计算可行驶路径。融入道路施工语义后,53.6%的目标角度误差(AE)小于0.5(提升9.9%),75.3%的路径角度误差小于0.5(提升8.1%)。