Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.
翻译:向量量化(VQ)将连续图像特征转换为离散表示,为生成模型提供压缩的、标记化的输入。然而,基于VQ的框架存在若干问题,例如潜在空间不平滑、量化前后表示之间的对齐较弱,以及连续域与离散域之间的连贯性差。这些问题导致码字学习不稳定和码本利用不足,最终降低重建和下游生成任务的性能。为此,我们提出了VAEVQ,它包含三个关键组件:(1)变分潜在量化(VLQ),用变分自编码器(VAE)替代自编码器(AE)进行量化,以利用其结构化和平滑的潜在空间,从而促进更有效的码字激活;(2)表示连贯性策略(RCS),自适应地调节量化前后特征的对齐强度,以增强一致性并防止对噪声的过拟合;(3)分布一致性正则化(DCR),将整个码本分布与连续潜在分布对齐,以提高利用率。在两个基准数据集上的大量实验表明,VAEVQ优于现有最先进方法。