Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation quality.Codes are available in the supplementary materials.
翻译:视觉生成涵盖图像与视频生成,旨在训练概率模型以从零开始生成连贯、多样且语义忠实的内容。早期研究主要关注无条件采样,而当前实践者则要求可控生成,以实现对布局、姿态、运动或风格的精确指定。尽管ControlNet提供了精确的时空控制能力,但其辅助分支显著增加了延迟,并在非控制区域与去噪步骤中引入了冗余计算,尤其在视频生成中更为突出。为解决此问题,我们提出了EVCtrl——一种轻量级即插即用控制适配器,可在无需重新训练模型的前提下大幅降低开销。具体而言,我们针对稀疏控制信息提出了一种时空双缓存策略。针对空间冗余,我们首先剖析了DiT-ControlNet各层对细粒度控制的响应特性,进而将网络划分为全局与局部功能区域。一种局部感知缓存机制将计算聚焦于真正需要控制信号的局部区域,跳过了全局区域中的大量冗余计算。针对时间冗余,我们通过选择性省略不必要的去噪步骤以提升效率。在CogVideo-Controlnet、Wan2.1-Controlnet及Flux上的大量实验表明,我们的方法在无需训练的情况下,能有效实现图像与视频的受控生成。例如,在CogVideo-Controlnet和Wan2.1-Controlnet上分别实现了2.16倍与2.05倍的加速,且生成质量几乎无衰减。代码已发布于补充材料中。