Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbf{Catechol Benchmark}, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ($\% B$). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments. Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbf{MSE of 0.0039} ($\pm$ 0.0003), representing a 60\% error reduction over competitive baselines and a $>25\times$ improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning.
翻译:预测连续溶剂组成范围内的反应结果仍是有机合成与过程化学中的关键挑战。传统机器学习方法通常将溶剂视为离散的分类变量,这阻碍了在溶剂空间中进行系统的插值与外推。本研究引入了**邻苯二酚基准**,这是一个高通量瞬态流动化学数据集,包含1,227个实验产率测量值,涉及烯丙基取代邻苯二酚在24种纯溶剂及其二元混合物中的重排反应,并通过连续体积分数($\% B$)进行参数化。我们在严格的留一溶剂外与留一混合物外验证协议下评估了多种模型架构,以测试其对未见化学环境的泛化能力。结果表明,经典的表格方法(如梯度提升决策树)与大语言模型嵌入(如Qwen-7B)在定量精度上表现不佳,其均方误差(MSE)分别为0.099和0.129。相比之下,我们提出了一种混合GNN架构,该架构将图注意力网络(GATs)与差分反应指纹(DRFP)以及学习得到的混合物感知溶剂编码相结合。该方法实现了**0.0039的MSE**($\pm$ 0.0003),相较于竞争基线误差降低了60%,相比表格集成方法提升了超过25倍。消融研究证实,显式的分子图消息传递与连续混合物编码对于稳健的泛化至关重要。我们发布了完整的数据集、评估协议与参考实现,以促进数据高效的反应预测与连续溶剂表示学习。