梯度引导最远点采样：用于鲁棒训练集选择的策略 (Gradient-Guided Furthest Point Sampling for Robust Training Set Selection)

Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular force norms to guide efficient sampling of configurational spaces of molecules. Numerical evidence is presented for a toy-system (Styblinski-Tang function) as well as for molecular dynamics trajectories from the MD17 dataset. Compared to FPS and uniform sampling, our numerical results indicate superior data efficiency and robustness when using GGFPS. Distribution analysis of the MD17 data suggests that FPS systematically under-samples equilibrium geometries, resulting in large test errors for relaxed structures. GGFPS cures this artifact and (i) enables up to two fold reductions in training cost without sacrificing predictive accuracy compared to FPS in the 2-dimensional Styblinksi-Tang system, (ii) systematically lowers prediction errors for equilibrium as well as strained structures in MD17, and (iii) systematically decreases prediction error variances across all of the MD17 configuration spaces. These results suggest that gradient-aware sampling methods hold great promise as effective training set selection tools, and that naive use of FPS may result in imbalanced training and inconsistent prediction outcomes.

翻译：智能训练集选择程序能够减少数据需求，并提升化学相关机器学习问题的预测鲁棒性。本文提出梯度引导最远点采样（GGFPS），这是最远点采样（FPS）的一种简单扩展方法，它利用分子力范数来指导分子构型空间的高效采样。我们通过玩具系统（Styblinski-Tang函数）以及MD17数据集的分子动力学轨迹提供了数值验证。与FPS和均匀采样相比，我们的数值结果表明GGFPS具有更优的数据效率和鲁棒性。对MD17数据的分布分析表明，FPS系统性地对平衡几何构型采样不足，导致松弛结构的测试误差较大。GGFPS修正了这一缺陷，并（i）在二维Styblinski-Tang系统中，相比FPS可在不牺牲预测精度的前提下将训练成本降低多达两倍；（ii）系统性地降低MD17中平衡结构与应变结构的预测误差；（iii）在所有MD17构型空间上系统性地减小预测误差方差。这些结果表明，梯度感知采样方法作为有效的训练集选择工具具有巨大潜力，而简单使用FPS可能导致训练不平衡和预测结果不一致。