Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART's generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model's effectiveness and robustness.
翻译:质谱分析在分子鉴定中发挥着关键作用,极大地推动了科学发现。然而,由于注释谱图的稀缺,从质谱数据中解析分子结构仍然具有挑战性。尽管大规模预训练已被证明能有效解决其他领域的数据稀缺问题,但由于原始谱信号的复杂性和异质性,将这一范式应用于质谱分析仍面临阻碍。为此,我们提出了MS-BART,一个统一的建模框架,它将质谱和分子结构映射到一个共享的标记词汇表中,从而能够通过对可靠计算的指纹-分子数据集进行大规模预训练来实现跨模态学习。多任务预训练目标通过联合优化去噪和翻译任务,进一步增强了MS-BART的泛化能力。随后,该预训练模型通过在MIST(一个预训练的谱图推理模型)生成的指纹预测上进行微调,迁移到实验谱图,从而增强了对真实世界谱图变异性的鲁棒性。虽然微调缓解了分布差异,但MS-BART仍存在分子幻觉问题,需要进一步对齐。因此,我们引入了一种化学反馈机制,引导模型生成更接近参考结构的分子。广泛的评估表明,MS-BART在MassSpecGym和NPLIB1基准的12个关键指标中的5个上达到了最先进的性能,并且比基于扩散的竞争方法快一个数量级,同时全面的消融研究系统地验证了模型的有效性和鲁棒性。