In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters, $k$, is a fundamental and recurring problem. We study a sequence of statistics called "cross-validated eigenvalues." Under a large class of random graph models, including both Poisson and Bernoulli edges, without parametric assumptions, we provide a $p$-value for each cross-validated eigenvalue. It tests the null hypothesis that the sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we show that our procedure consistently estimates $k$. In simulations and data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.
翻译:在应用多元统计学中,估计潜在维度数量或聚类数量$k$是一个基础且反复出现的问题。我们研究了一系列称为“交叉验证特征值”的统计量。在包含泊松边和伯努利边的广泛随机图模型类别下,无需参数假设,我们为每个交叉验证特征值提供了一个$p$值。该检验用于验证样本特征向量是否与真实潜在维度正交(即不相关)。此方法自然适用于某些维度在统计上不可检测的问题。在所有$k$个维度均可估计的场景中,我们证明该程序能一致地估计$k$。在模拟和实际数据示例中,所提出的估计器在计算性能和统计性能上均优于其他替代方法。