We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.
翻译:我们提出了一项新颖的任务——视频描述中的幻觉定位,其目标是在跨度级别(即单个词语或短语)识别视频描述中的幻觉内容。与现有的句子级幻觉检测任务相比,该方法能够对幻觉进行更细致的分析。为了建立幻觉定位的基准,我们构建了HLVC数据集,该数据集通过手动标注来自VideoLLM生成的描述的1,167个视频-描述对精心整理而成。我们进一步实现了一种基于VideoLLM的基线方法,并进行了定量与定性评估,以衡量当前在幻觉定位任务上的性能水平。