Methods for global measurement of transcript abundance such as microarrays and RNA-seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives, or ignore any unknown grouping structures for the features. We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees.
翻译:微粒和RNA-seq等值等笔录丰度全球测量方法产生数据集,其中测量的特征数目远远超过观测次数。因此,从这些数据中提取具有生物意义和实验性可移植的洞察力需要高维的预测。现有的稀少的线性方法对这一挑战取得了惊人的成功,但仍然存在一些重要问题。这些方法可能无法选择正确的特征,无法预测出与非偏差的替代品相比的差,或忽视任何未知的特征组别结构。我们提议了一种称为SffipPCR的方法,在高维任务中产生更好的预测,包括回归和分类,特别是在具有相关特征的典型食谱中。SffipPCR首先估计了稀少的主要组成部分,然后估计了回收的子空间的线性模型。由于这些特征中估计的子空间很少,因此产生的预测将仅依赖于少量的基因。SffipPCR在各种模拟和实验性笔录数据上运作良好,在模型假设令人满意时几乎最优化地执行。我们还展示了接近最佳的理论保证。