As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.
翻译:随着高质量领域特定数据的获取日益稀缺,多轮次训练已成为适配大语言模型(LLMs)的一种实用策略。然而,自回归模型在重复数据暴露下常出现性能退化,其中过拟合导致模型能力显著下降。通过实证分析,我们将这种退化归因于学习动态的不平衡:可预测的低熵令牌被快速学习并主导优化过程,而模型在高熵令牌上的泛化能力随着持续训练而恶化。为解决此问题,我们提出了EntroDrop,一种作为结构化数据正则化的熵引导令牌丢弃方法。EntroDrop在训练过程中选择性地掩码低熵令牌,并采用课程调度策略根据训练进度调整正则化强度。在0.6B至8B参数规模的模型上的实验表明,EntroDrop始终优于标准正则化基线,并在扩展的多轮次训练中保持稳健性能。这些发现强调了在有限数据训练时,使正则化与令牌级学习动态保持一致的重要性。我们的方法为在数据受限领域更有效地适配LLMs提供了一条有前景的路径。