Deep networks are heavily over-parameterized, yet their learned representations often admit low-rank structure. We introduce a framework for estimating a model's intrinsic dimensionality by treating learned representations as projections onto a low-rank subspace of the model's full capacity. Our approach: train a full-rank teacher, factorize its weights at multiple ranks, and train each factorized student via distillation to measure performance as a function of rank. We define effective rank as a region, not a point: the smallest contiguous set of ranks for which the student reaches 85-95% of teacher accuracy. To stabilize estimates, we fit accuracy vs. rank with a monotone PCHIP interpolant and identify crossings of the normalized curve. We also define the effective knee as the rank maximizing perpendicular distance between the smoothed accuracy curve and its endpoint secant; an intrinsic indicator of where marginal gains concentrate. On ViT-B/32 fine-tuned on CIFAR-100 (one seed, due to compute constraints), factorizing linear blocks and training with distillation yields an effective-rank region of approximately [16, 34] and an effective knee at r* ~ 31. At rank 32, the student attains 69.46% top-1 accuracy vs. 73.35% for the teacher (~94.7% of baseline) while achieving substantial parameter compression. We provide a framework to estimate effective-rank regions and knees across architectures and datasets, offering a practical tool for characterizing the intrinsic dimensionality of deep models.
翻译:深度网络通常存在严重的过参数化,但其学习到的表征往往具有低秩结构。我们提出了一种估计模型内在维度的框架,将学习到的表征视为模型全容量低秩子空间上的投影。我们的方法为:训练一个全秩教师模型,在多个秩值下分解其权重,并通过蒸馏训练每个分解后的学生模型,以衡量性能随秩的变化。我们将有效秩定义为一个区域而非单一数值:即学生模型达到教师模型准确率85-95%的最小连续秩集合。为稳定估计结果,我们使用单调PCHIP插值拟合准确率随秩的变化曲线,并识别归一化曲线的交叉点。同时,我们定义有效拐点为平滑准确率曲线与其端点割线垂直距离最大时的秩值,这是边际收益集中位置的内在指标。在CIFAR-100上微调的ViT-B/32模型(受计算限制仅使用一个随机种子)中,对线性块进行分解并通过蒸馏训练,得到的有效秩区域约为[16, 34],有效拐点位于r* ~ 31。在秩为32时,学生模型获得69.46%的top-1准确率,而教师模型为73.35%(约为基线的94.7%),同时实现了显著的参数压缩。我们提供了一个可跨架构和数据集估计有效秩区域与拐点的框架,为表征深度模型的内在维度提供了实用工具。