Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at https://github.com/MKJia/DINO-Tok.
翻译:近期视觉生成领域的进展突显了潜在生成模型的兴起,这些模型依赖高效的视觉分词器来连接像素与语义。然而,现有的分词器通常从头训练,难以平衡语义表征与重建保真度,尤其是在高维潜在空间中。本研究提出DINO-Tok,一种基于DINO的视觉分词器,它将分层表征统一到信息完备的潜在空间中。通过整合保留细粒度细节的浅层特征与编码全局语义的深层特征,DINO-Tok有效桥接了预训练表征与视觉生成。我们进一步分析了高维空间中向量量化的挑战,其中关键信息常丢失且码书崩溃频发。为此,我们提出全局PCA重加权机制以稳定向量量化并跨维度保留关键信息。在ImageNet 256×256数据集上,DINO-Tok实现了最先进的重建性能,自编码达到28.54 PSNR,基于向量量化的建模达到23.98 PSNR,显著优于现有分词器,并与基于十亿级数据训练的模型(如Hunyuan和Wan)性能相当。这些结果表明,适配强大的预训练视觉模型(如DINO)用于分词任务,能够实现语义对齐且高保真的潜在表征,从而推动下一代视觉生成模型的发展。代码将在https://github.com/MKJia/DINO-Tok公开提供。