Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention and unveil a pervasive token-level redundancy in the fine-tuning of large models. We propose TS-PEFT, a theoretically grounded framework utilizing proximal optimization to dynamically identify and skip redundant token updates during training. Our extensive experiments across Natural Language Understanding, Commonsense Reasoning, and Visual Instruction Tuning demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Strikingly, by discarding 40%-60% of token updates, TS-PEFT consistently matches or surpasses the performance of dense baselines (e.g., LoRA, DoRA). Furthermore, we provide an in-depth analysis revealing that the learned token-level sparsity serves as a superior indicator of module importance compared to traditional weight norms, offering a novel data-driven perspective on the intrinsic adaptation mechanism of large models.
翻译:当前的参数高效微调(PEFT)方法通常基于一个隐含假设:一旦选定目标模块,流经该模块的每个令牌对下游任务的贡献均等,且均需进行参数更新。本文挑战这一惯例,揭示了大型模型微调中普遍存在的令牌级冗余。我们提出TS-PEFT,这是一个基于近端优化的理论框架,能在训练过程中动态识别并跳过冗余的令牌更新。我们在自然语言理解、常识推理和视觉指令调优等任务上的大量实验表明,无差别地更新所有令牌不仅计算冗余,且常引入优化噪声。引人注目的是,通过丢弃40%-60%的令牌更新,TS-PEFT始终达到或超越密集基线方法(如LoRA、DoRA)的性能。此外,我们通过深入分析揭示,学习到的令牌级稀疏性相较于传统的权重范数,能更有效地表征模块重要性,为理解大型模型的内在适应机制提供了新颖的数据驱动视角。