利用分类数据集评价图表外星探测:透视和新透视 (On Using Classification Datasets to Evaluate Graph Outlier Detection: Peculiar Observations and New Insights)

It is common practice of the outlier mining community to repurpose classification datasets toward evaluating various detection models. To that end, often a binary classification dataset is used, where samples from (typically, the larger) one of the classes is designated as the inlier samples, and the other class is substantially down-sampled to create the (ground-truth) outlier samples. In this study, we identify an intriguing issue with repurposing graph classification datasets for graph outlier detection in this manner. Surprisingly, the detection performance of outlier models depends significantly on which class is down-sampled; put differently, accuracy often flips from high to low depending on which of the classes is down-sampled to represent the outlier samples. The problem is notably exacerbated particularly for a certain family of propagation based outlier detection models. Through careful analysis, we show that this issue mainly stems from disparate within-class sample similarity - which is amplified by various propagation based models - that impacts key characteristics of inlier/outlier distributions and indirectly, the difficulty of the outlier detection task and hence performance outcomes. With this study, we aim to draw attention to this (to our knowledge) previously-unnoticed issue, as it has implications for fair and effective evaluation of detection models, and hope that it will motivate the design of better evaluation benchmarks for outlier detection. Finally, we discuss the possibly overarching implications of using propagation based models on datasets with disparate within-class sample similarity beyond outlier detection, specifically for graph classification and graph-level clustering tasks.

翻译：外部采矿界通常的做法是将分类数据集重新定位为评估各种探测模型。为此,通常使用二进制分类数据集,从(通常为较大)一个类中抽取的样本指定为内流样本,而其他类则大量下取样,以创建(地面-真相)外流样本。在本研究中,我们发现一个令人感兴趣的问题,即用这种方式将图表分类分类数据集重新定位,以进行图形外源检测。令人惊讶的是,外源模型的检测性能在很大程度上取决于哪个类是下流抽样;不同之处是,相似的样本往往从高到低,取决于哪个类中抽取的样本,而另一类则被大量降为外源样本。对于基于外源检测模型的某类(地面-真相),我们通过仔细分析,我们发现这一问题主要来自不同级内部样本的相似性(通过各种基于传播的模型来放大 ) 影响内/外源分布的关键特征,间接地,我们使用离值检测的样本的难度往往从高到低位样本的低位,因此,我们最终要研究如何观察,我们如何观察,最终要研究。