超越百万标记：大语言模型长期记忆能力的基准测试与增强 (Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs)

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

翻译：评估大语言模型（LLMs）在需要长期记忆及长上下文推理的任务（例如对话场景）中的能力，受到现有基准测试的制约，这些基准往往缺乏叙事连贯性、覆盖领域狭窄，且仅测试简单的回忆型任务。本文针对这些挑战提出了一套综合性解决方案。首先，我们提出了一种新颖的框架，用于自动生成长度可达1000万标记、连贯且主题多样的对话，并附带针对广泛记忆能力的探测性问题。基于此，我们构建了BEAM——一个包含100段对话和2000个已验证问题的新基准。其次，为提升模型性能，我们提出了LIGHT框架，该框架受人类认知启发，为LLMs配备了三种互补的记忆系统：长期情景记忆、短期工作记忆以及用于积累关键事实的暂存器。我们在BEAM上的实验表明，即使具备100万标记上下文窗口的LLMs（无论是否采用检索增强技术），随着对话长度的增加，其表现也会显著下降。相比之下，LIGHT在不同模型中均能持续提升性能，相较于最强基线模型，平均提升幅度在3.5%至12.69%之间，具体取决于所采用的骨干LLM。消融研究进一步证实了每个记忆组件的贡献。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日