Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.
翻译:数据质量是驱动大语言模型性能的关键因素,然而现有的基于模型的选择方法几乎完全专注于英语。我们提出了MuRating,一个可扩展的框架,能够将高质量的英语数据质量信号迁移到一个适用于17种目标语言的单一评分器中。MuRating通过成对比较聚合多个英语“评分器”以学习统一的文档质量分数,然后通过翻译将这些判断进行投射,进而在单语、跨语言和平行文本对上训练一个多语言评估器。应用于网络数据时,MuRating能够选择平衡的英语和多语言内容子集,用于预训练一个12亿参数的LLaMA模型。与包括QuRater、AskLLM、DCLM等在内的强基线方法相比,我们的方法在英语基准测试和多语言评估上的平均准确率均有提升,在知识密集型任务上提升尤为显著。我们进一步分析了翻译保真度、选择偏差以及叙事材料的代表性不足问题,并指出了未来的研究方向。