Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.
翻译:大语言模型(LLMs)在多种自然语言处理(NLP)任务中展现出强大性能,但在处理长文本序列时,常因“中间信息丢失”现象而受限。研究表明,这一问题源于U形注意力偏差,即注意力过度集中于文本的首尾部分,导致中间段落表征不足。以往研究多将此偏差归因于位置编码,而本文首次识别出另一关键因素:初始显著性。这意味着在计算每个标记的注意力时,相对于初始标记具有更高注意力权重的标记,在预测下一标记时倾向于获得更多关注。我们进一步发现,通过缩放初始标记与其他标记间的注意力权重,可提升模型处理长上下文的能力,在MDQA数据集上最高实现3.6%的性能提升。此外,将此方法与现有降低位置编码偏差的技术结合,能进一步优化性能,在KV检索任务中最高获得3.4%的改进。