马尔可夫尺度预测：视觉自回归生成的新纪元 (Markovian Scale Prediction: A New Era of Visual Autoregressive Generation)

Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.

翻译：基于下一尺度预测的视觉自回归建模（VAR）为自回归视觉生成注入了新的活力。尽管其全上下文依赖性（即利用所有先前尺度进行下一尺度预测）通过完整的信息流促进了更稳定和全面的表示学习，但由此产生的计算效率低下和巨大开销严重阻碍了VAR的实用性和可扩展性。这促使我们开发一种无需全上下文依赖、具有更好性能和效率的新型VAR模型。为此，我们将VAR重新表述为非全上下文的马尔可夫过程，提出了马尔可夫VAR（Markov-VAR）。这是通过马尔可夫尺度预测实现的：我们将每个尺度视为一个马尔可夫状态，并引入一个滑动窗口，将某些先前尺度压缩为一个紧凑的历史向量，以补偿因非全上下文依赖性导致的历史信息损失。将历史向量与马尔可夫状态相结合，产生了一个在马尔可夫过程下演化的代表性动态状态。大量实验表明，Markov-VAR极其简单且高度有效：与ImageNet上的VAR相比，Markov-VAR将FID降低了10.5%（256×256分辨率），并将峰值内存消耗减少了83.8%（1024×1024分辨率）。我们相信，Markov-VAR可以作为未来视觉自回归生成及其他下游任务研究的基础。