UniSite：首个用于端到端配体结合位点检测的跨结构数据集与学习框架 (UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection)

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

翻译：蛋白质配体结合位点的检测是基于结构的药物设计中的基础步骤。尽管近年来取得了显著进展，但现有方法、数据集和评估指标仍面临若干关键挑战：（1）当前数据集和方法主要围绕单个蛋白质-配体复合物展开，忽略了同一蛋白质在不同复合物中可能存在多样化的结合位点，从而引入了显著的统计偏差；（2）配体结合位点检测通常被建模为不连续的工作流程，采用二值分割及后续聚类算法；（3）传统评估指标未能充分反映不同结合位点预测方法的实际性能。为解决这些问题，我们首先提出了UniSite-DS，这是首个以UniProt（唯一蛋白质）为中心的配体结合位点数据集，其多结合位点数据量是先前最广泛使用数据集的4.81倍，总数据量是2.08倍。随后，我们提出了UniSite，这是首个采用基于双射匹配的集合预测损失进行监督的端到端配体结合位点检测框架。此外，我们引入了基于交并比（IoU）的平均精度作为更准确的配体结合位点预测评估指标。在UniSite-DS及多个代表性基准数据集上的大量实验表明，基于IoU的平均精度能更准确地反映预测质量，且UniSite在配体结合位点检测中优于当前最先进的方法。数据集与代码将在https://github.com/quanlin-wu/unisite公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日