Ensuring the reliability of autonomous driving perception systems requires extensive environment-based testing, yet real-world execution is often impractical. Synthetic datasets have therefore emerged as a promising alternative, offering advantages such as cost-effectiveness, bias free labeling, and controllable scenarios. However, the domain gap between synthetic and real-world datasets remains a major obstacle to model generalization. To address this challenge from a data-centric perspective, this paper introduces a profile extraction and discovery framework for characterizing the style profiles underlying both synthetic and real image datasets. We propose Style Embedding Distribution Discrepancy (SEDD) as a novel evaluation metric. Our framework combines Gram matrix-based style extraction with metric learning optimized for intra-class compactness and inter-class separation to extract style embeddings. Furthermore, we establish a benchmark using publicly available datasets. Experiments are conducted on a variety of datasets and sim-to-real methods, and the results show that our method is capable of quantifying the synthetic-to-real gap. This work provides a standardized profiling-based quality control paradigm that enables systematic diagnosis and targeted enhancement of synthetic datasets, advancing future development of data-driven autonomous driving systems.
翻译:确保自动驾驶感知系统的可靠性需要进行广泛的环境测试,但现实世界的执行往往不切实际。因此,合成数据集已成为一种有前景的替代方案,提供了成本效益高、无偏标签和可控场景等优势。然而,合成数据集与真实世界数据集之间的领域差距仍然是模型泛化的主要障碍。为了从数据中心的视角应对这一挑战,本文引入了一种画像提取与发现框架,用于刻画合成与真实图像数据集背后的风格画像。我们提出了风格嵌入分布差异(SEDD)作为一种新颖的评估指标。我们的框架将基于Gram矩阵的风格提取与优化类内紧凑性和类间分离性的度量学习相结合,以提取风格嵌入。此外,我们利用公开可用的数据集建立了一个基准。在多种数据集和仿真到真实的方法上进行了实验,结果表明我们的方法能够量化合成-真实差距。这项工作提供了一种标准化的基于画像的质量控制范式,能够对合成数据集进行系统性诊断和针对性增强,从而推动未来数据驱动的自动驾驶系统的发展。