Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our approach leverages a novel multi-module architecture that integrates three independent detection strategies: the Semantic module, the Specialised Detection module, and the Contextual Consistency module. Our evaluation reveals systematic performance disparities across domains: existing methods perform well on biographical content but struggle significantly with mathematical reasoning, a challenge that persists across NLI fine-tuning, preference learning, and process supervision approaches. These findings highlight the fundamental limitations of current detection methods in mathematical domains and underscore the critical need for specialised, black-box compatible approaches to ensure reliable LLM deployment.
翻译:大语言模型(LLMs)已在从开放域问答到科学写作、医疗决策支持和法律分析等多种应用中展现出卓越能力。然而,其生成错误或捏造内容的倾向,通常被称为幻觉,是其在关键领域可靠部署的主要障碍。现有的幻觉检测基准在范围上存在局限,主要集中于通用知识领域,而忽视了准确性至关重要的专业领域。为填补这一空白,我们引入了AIME数学幻觉数据集,这是首个专门用于评估数学推理幻觉的综合基准。此外,我们提出了SelfCheck-Eval,一个与LLM无关的黑盒幻觉检测框架,适用于开源和闭源LLMs。我们的方法利用一种新颖的多模块架构,该架构整合了三种独立的检测策略:语义模块、专业检测模块和上下文一致性模块。我们的评估揭示了跨领域的系统性性能差异:现有方法在传记内容上表现良好,但在数学推理方面却显著受挫,这一挑战在NLI微调、偏好学习和过程监督方法中持续存在。这些发现突显了当前检测方法在数学领域的根本局限性,并强调了开发专业的、兼容黑盒的方法以确保LLM可靠部署的迫切需求。