We design an efficient algorithm that outputs tests for identifying predominantly homogeneous subcohorts of patients from large in-homogeneous datasets. Our theoretical contribution is a rounding technique, similar to that of Goemans and Wiliamson (1995), that approximates the optimal solution within a factor of $0.82$. As an application, we use our algorithm to trade-off sensitivity for specificity to systematically identify clinically interesting homogeneous subcohorts of patients in the RNA microarray dataset for breast cancer from Curtis et al. (2012). One such clinically interesting subcohort suggests a link between LXR over-expression and BRCA2 and MSH6 methylation levels for patients in that subcohort.
翻译:我们设计了一种高效算法,用于从大规模非均匀数据集中输出识别主要同质患者亚群的检验方法。我们的理论贡献是一种类似于Goemans和Williamson(1995)的舍入技术,该技术能以$0.82$的近似比逼近最优解。作为应用,我们利用该算法在敏感性与特异性之间进行权衡,系统性地识别了Curtis等人(2012)乳腺癌RNA微阵列数据集中具有临床意义的同质患者亚群。其中一个具有临床意义的亚群表明,该亚群患者的LXR过表达与BRCA2及MSH6甲基化水平存在关联。