Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph -- thereby broadening applicability -- remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing ``the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa. To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbf{Q}wen) with an E(3)-equivariant graph Transformer (\textbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects ``3D geometric information'' into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~eV to 0.486~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.
翻译:吸附能是催化反应活性的关键描述符。其基本定义为吸附物-表面体系的弛豫总能量与适当参考态总能量之差;因此,弛豫能预测的精度直接决定了机器学习驱动催化剂筛选的可靠性。E(3)-等变图神经网络(GNNs)能够原生地在周期性边界条件下处理三维原子坐标,并在此类任务中展现出强大性能。相比之下,基于语言模型的方法虽然能够生成人类可读的文本描述并减少对显式图的依赖——从而拓宽了适用性——但在吸附构型能量预测精度以及区分“相同体系的不同构型”方面仍显不足,即使采用GAP-CATBERTa风格的图辅助预训练亦如此。为此,我们提出了QE-Catalytic,这是一个将大型语言模型(**Q**wen)与E(3)-等变图Transformer(**E**quiformer-V2)深度耦合的多模态框架,能够统一支持复杂催化表面的吸附构型性质预测和逆向设计。在预测过程中,QE-Catalytic联合利用三维结构和结构化构型文本,并通过图-文本对齐将“三维几何信息”注入语言通道,使其在精确坐标不可用时能作为高性能的基于文本的预测器,同时还能自回归地生成用于目标能量驱动的结构设计与信息补全的CIF文件。在OC20数据集上,QE-Catalytic将弛豫吸附能的平均绝对误差从0.713 eV降低至0.486 eV,并在多种评估协议中持续优于CatBERTa和GAP-CATBERTa等基线模型。