Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.
翻译:大语言模型(LLMs)日益扮演研究助手的角色,但其在学术任务中的可靠性尚未得到充分评估。本文提出PaperAsk基准,系统性地评估LLMs在四项关键研究任务中的表现:文献引用检索、内容提取、论文发现与主张验证。我们在真实使用场景下——通过用户无法感知检索过程的网页界面——评估了GPT-4o、GPT-5和Gemini-2.5-Flash模型。控制实验显示存在持续性的可靠性缺陷:多参考文献查询的引用检索失败率达48-98%,章节特定内容提取失败率为72-91%,主题论文发现的F1分数低于0.32,漏检相关文献超过60%。进一步的人工分析表明,这些缺陷源于检索上下文的不可控扩展,以及LLMs倾向于优先处理语义相关文本而非遵循任务指令。在基础任务中,各LLMs表现出不同的失效行为:ChatGPT常选择拒绝响应而非承担错误风险,而Gemini则生成流畅但虚构的答案。为解决这些问题,我们基于PaperAsk数据训练了轻量级可靠性分类器以识别不可靠输出。PaperAsk为推进基于LLM的学术辅助系统可靠性评估提供了可复现的诊断框架。