Foundational machine learning interatomic potentials (MLIPs) are being developed at a rapid pace, promising closer and closer approximation to ab initio accuracy. This unlocks the possibility to simulate much larger length and time scales. However, benchmarks for these MLIPs are usually limited to ordered, crystalline and bulk materials. Hence, reported performance does not necessarily accurately reflect MLIP performance in real applications such as heterogeneous catalysis. Here, we systematically analyze zero-shot performance of 80 different MLIPs, evaluating tasks typical for heterogeneous catalysis across a range of different data sets, including adsorption and reaction on surfaces of alloyed metals, oxides, and metal-oxide interfacial systems. We demonstrate that current-generation foundational MLIPs can already perform at high accuracy for applications such as predicting vacancy formation energies of perovskite oxides or zero-point energies of supported nanoclusters. However, limitations also exist. We find that many MLIPs catastrophically fail when applied to magnetic materials, and structure relaxation in the MLIP generally increases the energy prediction error compared to single-point evaluation of a previously optimized structure. Comparing low-cost task-specific models to foundational MLIPs, we highlight some core differences between these model approaches and show that -- if considering only accuracy -- these models can compete with the current generation of best-performing MLIPs. Furthermore, we show that no single MLIP universally performs best, requiring users to investigate MLIP suitability for their desired application.
翻译:基础机器学习原子间势(MLIPs)正以迅猛的速度发展,承诺越来越接近从头算的精度。这为模拟更大尺度和更长时程提供了可能。然而,这些MLIPs的基准测试通常局限于有序、晶体和块体材料。因此,报告的性能未必能准确反映MLIPs在异相催化等实际应用中的表现。在此,我们系统分析了80种不同MLIPs的零样本性能,评估了涵盖多种数据集的异相催化典型任务,包括合金金属、氧化物及金属-氧化物界面体系表面的吸附和反应。我们证明,当前一代的基础MLIPs已能在诸如预测钙钛矿氧化物的空位形成能或负载纳米团簇的零点能等应用中实现高精度。然而,局限性依然存在。我们发现,许多MLIPs在应用于磁性材料时会出现灾难性失败,且与先前优化结构的单点评估相比,MLIP中的结构弛豫通常会增大能量预测误差。通过比较低成本任务特定模型与基础MLIPs,我们强调了这些模型方法之间的核心差异,并表明——若仅考虑准确性——这些模型可与当前一代性能最佳的MLIPs相竞争。此外,我们证明没有单一MLIP能普遍表现最佳,这要求用户根据其目标应用探究MLIP的适用性。