Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.
翻译:深度学习模型在各领域实现了最先进的性能,但在实时或资源受限场景下面临可扩展性挑战。为此,我们提出损失差异相关性(CLD),这是一种用于核心集选择的简单且可扩展的度量方法,通过衡量训练样本与预留验证集损失轨迹的一致性来识别最具影响力的训练样本。CLD具有高效性,仅需在训练检查点计算每个样本的损失值,避免了现有许多子集选择方法中代价高昂的梯度和曲率计算。我们建立了一个通用理论框架,为基于CLD的核心集提供了收敛性保证,证明其收敛误差的上界取决于所选样本与验证集代表性之间的对齐程度。在CIFAR-100和ImageNet-1k数据集上,基于CLD的核心集在不同子集规模下通常优于或接近最先进方法,即使未领先时,其性能也保持在计算成本更高的基线方法的1%以内。CLD能有效跨架构(ResNet、VGG、DenseNet)迁移,实现代理到目标选择时性能下降小于1%。此外,CLD在使用早期检查点时表现稳定,准确率损失可忽略不计。最后,CLD通过每类验证对齐实现了固有的偏差减少,无需额外的分层采样。综上所述,这些特性使CLD成为一种原理清晰、高效、稳定且可迁移的可扩展数据集优化工具。