Machine learning has opened up new tools for financial fraud detection. Using a sample of annotated transactions, a machine learning classification algorithm learns to detect frauds. With growing credit card transaction volumes and rising fraud percentages there is growing interest in finding appropriate machine learning classifiers for detection. However, fraud data sets are diverse and exhibit inconsistent characteristics. As a result, a model effective on a given data set is not guaranteed to perform on another. Further, the possibility of temporal drift in data patterns and characteristics over time is high. Additionally, fraud data has massive and varying imbalance. In this work, we evaluate sampling methods as a viable pre-processing mechanism to handle imbalance and propose a data-driven classifier selection strategy for characteristic highly imbalanced fraud detection data sets. The model derived based on our selection strategy surpasses peer models, whilst working in more realistic conditions, establishing the effectiveness of the strategy.
翻译:机器学习为识别金融欺诈开辟了新工具。使用附加说明的交易样本,机器学习分类算法学会检测欺诈。随着信用卡交易量的不断增加和欺诈率的上升,人们越来越有兴趣找到适当的机器学习分类器进行检测。然而,欺诈数据集多种多样,具有不一致的特点。因此,无法保证对特定数据集有效的模型在另一个数据集上发挥作用。此外,数据模式和特征随着时间推移的瞬间漂移可能性很大。此外,欺诈数据存在巨大和不同的不平衡性。在这项工作中,我们评估抽样方法,作为处理不平衡问题的可行的预处理机制,并提出数据驱动的分类器选择战略,以发现典型的高度不平衡的欺诈检测数据集。基于我们选择战略的模型超越了同行模型,同时在更现实的条件下开展工作,确定战略的有效性。