Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.
翻译:大语言模型(LLM)中的幻觉通常被视为需要最小化的错误。然而,近期观点表明,某些幻觉可能编码了具有创造性或认知价值的内容,这一维度在当前文献中仍未得到充分量化。现有的幻觉检测方法主要关注事实一致性,难以处理异构的科学任务,并在创造性与准确性之间取得平衡。为应对这些挑战,我们提出了HIC-Bench,一个新颖的评估框架,将幻觉分类为智能幻觉(IH)与缺陷幻觉(DH),从而系统性地研究它们在LLM创造力中的相互作用。HIC-Bench具备三个核心特征:(1)结构化IH/DH评估,采用一个多维度量矩阵,整合了托伦斯创造性思维测试(TTCT)指标(原创性、可行性、价值)与幻觉特定维度(科学合理性、事实偏差);(2)跨领域适用性,涵盖十个科学领域的开放式创新任务;以及(3)动态提示优化,利用动态幻觉提示(DHP)引导模型生成兼具创造性与可靠性的输出。评估过程采用多个LLM评委,通过平均得分以减少偏差,并由人类标注员验证IH/DH分类。实验结果揭示了IH与DH之间的非线性关系,表明创造力与正确性可以协同优化。这些见解将IH定位为创造力的催化剂,并揭示了LLM幻觉推动科学创新的潜力。此外,HIC-Bench为推进LLM幻觉创造性智能的研究提供了一个宝贵的平台。