This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full datasets, and we provide analytical conditions for this by deriving precise phase transition curves tied to data size and quality. We validate these theoretical claims with empirical results on ImageNet, confirming our predictions about when curation improves accuracy and can even mitigate model collapse. Furthermore, our framework provides a principled explanation for the contradictory curation strategies recently observed in LLM mathematical reasoning.
翻译:本文提出一个理论框架,以解决现代机器学习中的一个核心悖论:何时使用更少的数据反而更好?随着经典缩放定律所主张的“越多越好”(Sun等人,2025年)受到诸如LIMO(“少即是多”)和s1(Ye等人,2025年;Muenighoff等人,2025年)等方法的挑战——这些方法通过小型、经过严格筛选的数据集实现了更优性能,该问题变得至关重要。在此,我们研究数据筛选策略,其中存在一个不完美的预言机,根据训练样本的难度和正确性进行选择。我们的结果为在标签无关和标签感知筛选规则下的测试误差提供了精确的缩放定律曲线,揭示了保留数据子集何时以及为何能提升泛化能力。与经典缩放定律不同,我们证明在某些条件下,小型筛选数据集可以超越完整数据集,并通过推导与数据规模和质量相关的精确相变曲线,为此提供了分析条件。我们在ImageNet上的实证结果验证了这些理论主张,证实了关于筛选何时能提高准确性、甚至能缓解模型崩溃的预测。此外,我们的框架为最近在大型语言模型数学推理中观察到的矛盾筛选策略提供了原则性解释。