MS-BART：用于结构解析的质谱与分子的统一建模 (MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation)

Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.

翻译：质谱分析在分子鉴定中发挥着关键作用，极大地推动了科学发现。然而，由于注释谱图的稀缺，从质谱数据中解析分子结构仍然具有挑战性。尽管大规模预训练已被证明能有效解决其他领域的数据稀缺问题，但由于原始谱信号的复杂性和异质性，将这一范式应用于质谱分析仍面临阻碍。为此，我们提出了MS-BART，一个统一的建模框架，它将质谱和分子结构映射到一个共享的标记词汇表中，从而能够通过对可靠计算的指纹-分子数据集进行大规模预训练来实现跨模态学习。多任务预训练目标通过联合优化去噪和翻译任务，进一步增强了MS-BART的泛化能力。随后，该预训练模型通过在MIST（一个预训练的谱图推理模型）生成的指纹预测上进行微调，迁移到实验谱图，从而增强了对真实世界谱图变异性的鲁棒性。虽然微调缓解了分布差异，但MS-BART仍存在分子幻觉问题，需要进一步对齐。因此，我们引入了一种化学反馈机制，引导模型生成更接近参考结构的分子。广泛的评估表明，MS-BART在MassSpecGym和NPLIB1基准的12个关键指标中的5个上达到了最先进的性能，并且比基于扩散的竞争方法快一个数量级，同时全面的消融研究系统地验证了模型的有效性和鲁棒性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日