Therapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom-level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino-acid-level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM-BERT, the first encoder-based peptide language model trained on HELM notation. Based on DeBERTa, HELM-BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre-trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM-BERT significantly outperforms state-of-the-art SMILES-based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction. These results demonstrate that HELM's explicit monomer- and topology-aware representations offer substantial data-efficiency advantages for modeling therapeutic peptides, bridging a long-standing gap between small-molecule and protein language models.
翻译:治疗性肽已成为现代药物发现中的关键模式,占据着化学与拓扑结构丰富的空间。尽管准确预测其物理化学性质对于加速肽类药物研发至关重要,但现有的分子语言模型所依赖的表征方法无法捕捉这种复杂性。原子级的SMILES符号会生成冗长的标记序列并掩盖环状拓扑结构,而氨基酸级的表征则无法编码现代肽设计核心的多样化化学修饰。为弥合这一表征鸿沟,大分子分层编辑语言(HELM)提供了一个统一框架,能够精确描述单体组成与连接关系,使其成为肽语言建模的理想基础。本文提出HELM-BERT——首个基于HELM符号训练的编码器型肽语言模型。该模型基于DeBERTa架构,专门设计用于捕捉HELM序列中的层次依赖关系。我们使用包含39,079个涵盖线性与环状结构的化学多样性肽段精选语料库对模型进行预训练。在下游任务中,HELM-BERT显著优于当前最先进的基于SMILES的语言模型,包括环肽膜渗透性预测和肽-蛋白质相互作用预测。这些结果表明,HELM显式的单体感知与拓扑感知表征为治疗性肽建模提供了显著的数据效率优势,从而弥合了小分子语言模型与蛋白质语言模型之间长期存在的鸿沟。