不存在VAE：通过自监督预训练实现端到端像素空间生成建模 (There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-training)

Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our framework achieves state-of-the-art (SOTA) performance on ImageNet. Specifically, our diffusion model reaches an FID of 1.58 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE) surpassing prior pixel-space methods and VAE-based counterparts by a large margin in both generation quality and training efficiency. In a direct comparison, our model significantly outperforms DiT while using only around 30\% of its training compute.

翻译：像素空间生成模型通常比其潜在空间对应模型更难训练且性能普遍较低，导致存在持续的性能与效率差距。本文提出一种新颖的两阶段训练框架，为像素空间扩散模型和一致性模型弥补了这一差距。在第一阶段，我们预训练编码器以从干净图像中捕获有意义的语义，同时将其与沿同一确定性采样轨迹的点对齐，该轨迹将点从先验分布演化至数据分布。在第二阶段，我们将编码器与随机初始化的解码器集成，并对完整模型进行端到端微调，适用于扩散模型和一致性模型。我们的框架在ImageNet上实现了最先进的性能。具体而言，我们的扩散模型在ImageNet-256上达到FID 1.58，在ImageNet-512上达到FID 2.35（使用75次函数评估），在生成质量和训练效率上均大幅超越先前的像素空间方法及基于VAE的模型。在直接比较中，我们的模型显著优于DiT，同时仅使用其约30%的训练计算量。