实时虚拟化身：基于无限时长流式实时音频驱动的虚拟化身生成 (Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length)

Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

翻译：现有基于扩散模型的视频生成方法本质上受限于序列计算与长时程不一致性问题，制约了其在实时流式音频驱动虚拟化身合成中的实际应用。本文提出Live Avatar，一种算法-系统协同设计的框架，利用一个140亿参数的扩散模型实现高效、高保真且无限时长的虚拟化身生成。我们的方法引入了时间步强制流水线并行（TPP），这是一种分布式推理范式，将去噪步骤流水线化分配到多个GPU上，有效打破自回归瓶颈，确保稳定、低延迟的实时流式处理。为进一步增强时间一致性并缓解身份漂移与色彩伪影，我们提出滚动汇聚帧机制（RSFM），该机制通过使用缓存的参考图像动态重新校准外观，以保持序列保真度。此外，我们利用自强制分布匹配蒸馏技术，在不牺牲视觉质量的前提下促进大规模模型的可因果、可流式适配。Live Avatar展示了最先进的性能，在5个H800 GPU上实现了端到端20 FPS的生成速度，据我们所知，这是首个在此规模上实现实用、实时、高保真虚拟化身生成的方法。我们的工作为在工业级长视频合成应用中部署先进扩散模型确立了新范式。