Graph classification has applications in bioinformatics, social sciences, automated fake news detection, web document classification, and more. In many practical scenarios, including web-scale applications, where labels are scarce or hard to obtain, unsupervised learning is a natural paradigm but it trades off performance. Recently, contrastive learning (CL) has enabled unsupervised computer vision models to compete well against supervised ones. Theoretical and empirical works analyzing visual CL frameworks find that leveraging large datasets and domain aware augmentations is essential for framework success. Interestingly, graph CL frameworks often report high performance while using orders of magnitude smaller data, and employing domain-agnostic augmentations (e.g., node or edge dropping, feature perturbations) that can corrupt the graphs' underlying properties. Motivated by these discrepancies, we seek to determine: (i) why existing graph CL frameworks perform well despite weak augmentations and limited data; and (ii) whether adhering to visual CL principles can improve performance on graph classification tasks. Through extensive analysis, we identify flawed practices in graph data augmentation and evaluation protocols that are commonly used in the graph CL literature, and propose improved practices and sanity checks for future research and applications. We show that on small benchmark datasets, the inductive bias of graph neural networks can significantly compensate for the limitations of existing frameworks. In case studies with relatively larger graph classification tasks, we find that commonly used domain-agnostic augmentations perform poorly, while adhering to principles in visual CL can significantly improve performance. For example, in graph-based document classification, which can be used for better web search, we show task-relevant augmentations improve accuracy by 20%.
翻译:在生物信息学、社会科学、自动假新闻检测、网络文件分类等许多实际情景中,图CL框架经常报告高性能,同时使用数量级较小的数据,使用域级数据增强和评估协议(例如,节点或边缘下降,特征扭曲),从而可以腐蚀图表的基本属性。最近,对比式学习(CL)使得未经监督的计算机视觉模型能够与受监督的模型进行良好的竞争。分析视觉CL框架的理论和实验性工作发现,利用大型数据集和域级增强意识对于框架的成功至关重要。有趣的是,图CL框架经常报告高性能,同时使用规模较小的数据增强(例如,节点或边缘下降,特征扭曲),从而可以腐蚀图表的基本属性。受这些差异的驱动,我们试图确定:(一)为什么现有的图表CL框架在增强能力薄弱和数据有限的情况下运行良好;以及(二)为了在图表分类中找到以视觉为基础的原则,通过广泛的分析,我们可以找出图表数据增强和评估协议中的错误做法,而在图表的CLO级数据库中,我们通常使用更精确的搜索和缩小的校略的校略性研究中,我们用来测量了当前数据库中,我们用来用来测量的校略的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正,可以显示的校正的校正。