Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.
翻译:视频异常检测旨在定位视频中偏离正常模式的事件。传统方法通常依赖大量标注数据且计算成本高昂。近期基于多模态大语言模型的免调优方法利用其丰富的世界知识,提供了一种有前景的替代方案。然而,这些方法通常依赖文本输出,这会引入信息损失、表现出正常性偏差并受提示词敏感性影响,导致其难以捕捉细微的异常线索。为解决这些限制,我们提出HeadHunt-VAD,一种新颖的免调优视频异常检测范式,通过直接在冻结的多模态大语言模型中搜寻鲁棒的异常敏感内部注意力头,绕过了文本生成过程。我们方法的核心是一个鲁棒注意力头识别模块,该模块通过显著性与稳定性的多准则分析系统评估所有注意力头,从而识别出在不同提示词下均保持判别性的稀疏头部子集。这些专家头的特征随后被输入轻量级异常评分器和时序定位器,实现高效、准确且具有可解释输出的异常检测。大量实验表明,HeadHunt-VAD在两个主流视频异常检测基准上取得了免调优方法中最先进的性能,同时保持了高效率,验证了在多模态大语言模型中进行头部级探测是一种强大且实用的现实世界异常检测解决方案。