Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. While current detection methods leverage embedding similarity and natural language inference (NLI), their reliability in safety-critical settings remains unproven. We apply conformal prediction to RAG hallucination detection, transforming heuristic scores into decision sets with finite-sample coverage guarantees (1-alpha). Using calibration sets of n=600, we demonstrate a fundamental dichotomy: on synthetic hallucinations (Natural Questions), embedding methods achieve 95% coverage with 0% False Positive Rate (FPR). However, on real hallucinations from RLHF-aligned models (HaluEval), the same methods fail catastrophically, yielding 100% FPR at target coverage. We analyze this failure through the lens of distributional tails, showing that while NLI models achieve acceptable AUC (0.81), the "hardest" hallucinations are semantically indistinguishable from faithful responses, forcing conformal thresholds to reject nearly all valid outputs. Crucially, GPT-4 as a judge achieves 7% FPR (95% CI:[3.4%, 13.7%]) on the same data, proving the task is solvable via reasoning but opaque to surface-level semantics--a phenomenon we term the "Semantic Illusion."
翻译:检索增强生成(RAG)系统尽管基于检索到的证据,仍然容易产生幻觉。虽然当前的检测方法利用嵌入相似性和自然语言推理(NLI),但其在安全关键环境中的可靠性尚未得到证实。我们将保形预测应用于RAG幻觉检测,将启发式分数转化为具有有限样本覆盖保证(1-alpha)的决策集。使用n=600的校准集,我们证明了一个根本性的二分现象:在合成幻觉(Natural Questions)上,嵌入方法实现了95%的覆盖率且误报率(FPR)为0%。然而,在来自RLHF对齐模型(HaluEval)的真实幻觉上,相同的方法却灾难性地失败,在目标覆盖率下产生100%的FPR。我们通过分布尾部的视角分析这一失败,表明虽然NLI模型达到了可接受的AUC(0.81),但“最困难”的幻觉在语义上与忠实响应无法区分,迫使保形阈值拒绝几乎所有有效输出。关键的是,GPT-4作为评判者在相同数据上实现了7%的FPR(95% CI:[3.4%, 13.7%]),证明该任务可通过推理解决,但对表层语义不透明——我们将这一现象称为“语义幻觉”。