Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.
翻译:尽管视频大语言模型近年来发展迅速,但感知幻觉问题构成了显著的安全风险,严重限制了其实际应用。虽然已有多种缓解幻觉的方法被提出,但这些方法往往以牺牲模型对视频的理解和推理能力为代价。本研究提出SmartSight,作为一种开创性的训练无关方法,通过利用模型自身的自省能力来解决这一问题。具体而言,SmartSight通过生成多个候选响应,以揭示常被标准贪婪解码所掩盖的低幻觉输出。它使用时间注意力崩溃分数评估每个响应的幻觉程度,该分数衡量模型在生成响应时是否过度关注输入视频的琐碎时间区域。为提高效率,SmartSight识别视觉注意力消失点,从而实现更准确的幻觉估计以及对幻觉响应的提前终止,从而显著降低解码成本。实验表明,SmartSight在VRIPT-HAL基准上将Qwen2.5-VL-7B的幻觉率显著降低了10.59%,同时增强了视频理解与推理能力,在VideoMMMU基准上的性能提升高达8.86%。这些结果凸显了SmartSight在提升开源视频大语言模型可靠性方面的有效性。