In machine learning, supervised classifiers are used to obtain predictions for unlabeled data by inferring prediction functions using labeled data. Supervised classifiers are widely applied in domains such as computational biology, computational physics and healthcare to make critical decisions. However, it is often hard to test supervised classifiers since the expected answers are unknown. This is commonly known as the \emph{oracle problem} and metamorphic testing (MT) has been used to test such programs. In MT, metamorphic relations (MRs) are developed from intrinsic characteristics of the software under test (SUT). These MRs are used to generate test data and to verify the correctness of the test results without the presence of a test oracle. Effectiveness of MT heavily depends on the MRs used for testing. In this paper we have conducted an extensive empirical study to evaluate the fault detection effectiveness of MRs that have been used in multiple previous studies to test supervised classifiers. Our study uses a total of 709 reachable mutants generated by multiple mutation engines and uses data sets with varying characteristics to test the SUT. Our results reveal that only 14.8\% of these mutants are detected using the MRs and that the fault detection effectiveness of these MRs do not scale with the increased number of mutants when compared to what was reported in previous studies.
翻译:在机器学习中,通过使用标签数据推断预测功能,使用受监督的分类器对未贴标签的数据进行预测; 在计算生物学、计算物理学和保健等领域广泛应用受监督的分类器,以便做出关键的决定; 然而,由于预期答案未知,通常很难测试受监督的分类器; 通常使用这种通常称为emph{oracle问题} 和变形测试来测试这类程序。 在MT 中,变形关系(MRS)是从测试中的软件的内在特征(SUT)发展出来的。 这些MR用于生成测试数据,并核实测试结果的正确性,而没有测试或触觉。 MT的效力在很大程度上取决于用于测试的MR。在本文件中,我们进行了广泛的实证研究,以评价过去多次研究中用来测试受监督的分类器的MR(M)的检测效力。 我们的研究使用了总共709种可达的变异基因(MR),使用具有不同特性的数据集来测试SUT。我们报告的M(M)结果显示,在以往的MR(M)和M(M)的检测中,只有14.8+(M)级(M)的检测中,这些变异变变的比的比的比比比的数值是多少。