大型语言模型能否检测其虚构内容？不确定性感知语言模型的可靠性评估 (Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models)

Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

翻译：大型语言模型（LLMs）倾向于生成流畅但错误的内容，即虚构现象，这在多轮对话或代理应用中构成日益增长的风险，因为输出可能被重新用作上下文。本研究探讨上下文信息如何影响模型行为，以及LLMs能否识别其不可靠的响应。我们提出一种可靠性估计方法，利用词元级不确定性指导内部模型表征的聚合。具体而言，我们从输出逻辑值计算任意性和认知性不确定性，以识别显著词元，并将其隐藏状态聚合为紧凑表征，用于响应级可靠性预测。通过在开放问答基准上的受控实验，我们发现正确的上下文信息能同时提升答案准确性和模型置信度，而误导性上下文常导致模型产生自信的错误响应，揭示了不确定性与正确性之间的错位。我们基于探针的方法捕捉了模型行为的这些变化，并在多个开源LLM中提升了对不可靠输出的检测能力。这些结果凸显了直接不确定性信号的局限性，并强调了不确定性引导的探针方法在可靠性感知生成中的潜力。