Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task$\unicode{x2014}$generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $\approx$40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.
翻译:材料(尤其无机晶体)的生成模型具有革新新型化合物与结构理论预测的潜力。该领域的发展依赖于稳健的基准测试及信息密集的最小数据集,以实现有效的模型评估。本文批判性审视了晶体结构预测任务——即给定材料化学成分生成最可能结构——中常用的数据集与报告指标。我们聚焦三个关键问题:首先,材料数据集应包含唯一的晶体结构;例如,我们证明广泛使用的碳-24数据集中仅含约40%的唯一结构。其次,当多种成分的多晶型体数量庞大时(我们在perov-5和MP-20数据集中发现此现象),材料数据集不应随机划分。第三,若不加批判地使用基准测试可能产生误导,例如报告匹配率指标时未考虑相同结构单元展现的结构多样性。针对这些常被忽视的问题,我们提出若干修正方案:提供碳-24数据集的修订版本——包括去重版本、按原子数N划分的去重版本、含对映体版本,以及两个仅含相同结构但晶胞不同的版本;针对含多晶型体的数据集提出新的划分方案,确保多晶型体归属于同一划分子集,为模型性能基准测试设立更合理的标准;最后,我们提出METRe与cRMSE两项新评估指标,可修正现有匹配率指标的问题。