Large Language Models (LLMs) are increasingly prevalent in the field of long-context modeling, however, their inference computational costs have become a critical bottleneck hindering the advancement of tasks such as agents and multimodal applications. This report conducts a preliminary investigation into the effectiveness and theoretical mechanisms of the Top-$k$ Attention mechanism during both the decoding and training phases. First, we validate the effectiveness of exact Top-$k$ Decoding through extensive experimentation. Experiments demonstrate that retaining only the pivotal Keys with the highest similarity to the Query as the context window during the decoding stage achieves performance comparable to, or even surpassing, full attention on downstream tasks such as HELMET and LongBench v2. Second, we further explore the native Top-$k$ Attention training strategy. Experiments confirm that ensuring the consistency between training and inference regarding Top-$k$ Attention operations facilitates the further unlocking of Top-$k$ Decoding's potential, thereby significantly enhancing model performance. Furthermore, considering the high computational complexity of exact Top-$k$ Attention, we investigate the impact of approximate Top-$k$ algorithm precision on downstream tasks. Our research confirms a positive correlation between downstream task performance and approximation fidelity, and we provide statistical evaluations of the Lightning Indexer's precision within the DeepSeek-V3.2-Exp model. Finally, this report provides a theoretical interpretation from the perspective of Entropy. Experimental observations indicate that models subjected to Top-$k$ Attention SFT exhibit a distinct phenomenon of entropy reduction in downstream tasks, which validates the hypothesis that low-entropy states are better adapted to Top-$k$ Decoding.
翻译:大语言模型在长上下文建模领域日益普及,但其推理计算成本已成为制约智能体与多模态应用等任务发展的关键瓶颈。本报告对Top-$k$注意力机制在解码与训练阶段的有效性及理论机制进行了初步探究。首先,通过大量实验验证了精确Top-$k$解码的有效性。实验表明,在解码阶段仅保留与查询向量相似度最高的关键键值作为上下文窗口,可在HELMET和LongBench v2等下游任务上达到与全注意力机制相当甚至更优的性能。其次,进一步探索了原生Top-$k$注意力训练策略。实验证实,保证训练与推理阶段Top-$k$注意力操作的一致性有助于进一步释放Top-$k$解码潜力,从而显著提升模型性能。此外,考虑到精确Top-$k$注意力的高计算复杂度,我们研究了近似Top-$k$算法精度对下游任务的影响。研究证实下游任务性能与近似保真度呈正相关,并对DeepSeek-V3.2-Exp模型中Lightning Indexer的精度进行了统计评估。最后,本报告从熵的视角提供了理论阐释。实验观测表明,经过Top-$k$注意力指令微调的模型在下游任务中呈现明显的熵减现象,这验证了低熵状态更适应Top-$k$解码的假设。