We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.
翻译:我们提出了DiverseVAR框架,该框架在测试时无需重新训练、微调或显著增加计算开销,即可增强基于文本条件的视觉自回归模型(VAR)的多样性。尽管VAR模型近期已成为图像生成领域扩散模型与流模型的有力竞争者,但其在多样性方面存在关键局限,即使对于简单提示也常生成近乎相同的图像。这一问题在现有研究主要关注图像质量的情况下尚未得到充分重视。我们通过两个阶段在测试时解决此局限:首先,受扩散模型中多样性增强技术的启发,我们提出向文本嵌入注入噪声。这引入了多样性与图像质量之间的权衡——随着多样性提升,图像质量会急剧下降。为保持质量,我们提出了尺度回溯:一种受扩散模型中时间回溯策略启发的新型潜在精炼技术。具体而言,我们使用多尺度自编码器提取粗尺度标记,从而能够在中间阶段恢复生成过程。大量实验表明,将文本嵌入噪声注入与我们提出的尺度回溯精炼技术相结合,可在最小化图像质量损失的同时显著提升多样性,从而在多样性-质量权衡中实现新的帕累托前沿。