We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language. We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L$^2$M) condition, which lower bounds the necessary scaling of a model's history state -- the latent variables responsible for storing past information -- for effective long-context modeling. We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.
翻译:我们提出了一个基于二分互信息缩放定律的通用理论框架,用于理解长上下文语言建模,并在自然语言中严格验证了该定律。我们证明,二分互信息捕捉了与传统的两点互信息不同且独立缩放的多标记交互,并表明这为准确建模长序列所需的依赖关系提供了更完整的刻画。利用此缩放定律,我们构建了长上下文语言建模(L$^2$M)条件,该条件为有效长上下文建模所需模型历史状态(负责存储过去信息的潜在变量)的缩放设定了下界。我们在Transformer和状态空间模型上验证了该框架及其预测。我们的工作为理解长上下文建模以及设计具有更强长上下文能力的高效架构提供了原则性基础,并具有超越自然语言的潜在应用前景。