The discovery of novel odorant molecules is key for the fragrance and flavor industries, yet efficiently navigating the vast chemical space to identify structures with desirable olfactory properties remains a significant challenge. Generative artificial intelligence offers a promising approach for \textit{de novo} molecular design but typically requires large sets of molecules to learn from. To address this problem, we present a framework combining a variational autoencoder (VAE) with a quantitative structure-activity relationship (QSAR) model to generate novel odorants from limited training sets of odor molecules. The self-supervised learning capabilities of the VAE allow it to learn SMILES grammar from ChemBL database, while its training objective is augmented with a loss term derived from an external QSAR model to structure the latent representation according to odor probability. While the VAE demonstrated high internal consistency in learning the QSAR supervision signal, validation against an external, unseen ground truth dataset (Unique Good Scents) confirms the model generates syntactically valid structures (100\% validity achieved via rejection sampling) and 94.8\% unique structures. The latent space is effectively structured by odor likelihood, evidenced by a Fréchet ChemNet Distance (FCD) of $\approx$ 6.96 between generated molecules and known odorants, compared to $\approx$ 21.6 for the ChemBL baseline. Structural analysis via Bemis-Murcko scaffolds reveals that 74.4\% of candidates possess novel core frameworks distinct from the training data, indicating the model performs extensive chemical space exploration beyond simple derivatization of known odorants. Generated candidates display physicochemical properties ....
翻译:新型气味分子的发现是香料和香精行业的关键,然而在广阔的化学空间中高效导航以识别具有理想嗅觉特性的结构仍然是一个重大挑战。生成式人工智能为从头分子设计提供了一种有前景的方法,但通常需要大量分子数据进行学习。为解决此问题,我们提出了一个将变分自编码器(VAE)与定量构效关系(QSAR)模型相结合的框架,能够基于有限的气味分子训练集生成新型气味分子。VAE的自监督学习能力使其能够从ChemBL数据库中学习SMILES语法,同时其训练目标通过引入源自外部QSAR模型的损失项进行增强,从而根据气味概率构建潜在表示。虽然VAE在学习QSAR监督信号时表现出高度的内部一致性,但通过外部未见过的真实数据集(Unique Good Scents)进行验证,确认模型生成的句法有效结构(通过拒绝采样实现100%有效性)中94.8%为独特结构。潜在空间通过气味可能性得到有效构建,生成分子与已知气味分子之间的弗雷歇化学网络距离(FCD)约为6.96,而ChemBL基线数据约为21.6。通过Bemis-Murcko骨架进行的结构分析表明,74.4%的候选分子具有不同于训练数据的新颖核心框架,这表明模型实现了超越已知气味分子简单衍生的广泛化学空间探索。生成的候选分子展现出符合...的理化性质。