Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we propose the Massive Genre-Audience(MGA) reformulation method, a lightweight and scalable data augmentation technique inspired by synthetic data methodologies. MGA systematically reformulates existing corpora into diverse, contextually-rich variations to mitigate the negative effects of repetition, and we introduce this approach along with the resulting 770 billion token MGACorpus in this work. We experimentally validate its core benefit by demonstrating superior performance against data repetition and upsampling in scaling scenarios (up to 13B parameters). Furthermore, comprehensive analysis investigates the role of prompt engineering in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.
翻译:暂无翻译