We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
翻译:我们提出\textbf{IC-Effect},一种基于指令引导的DiT框架,用于少样本视频视觉特效编辑,能够合成复杂特效(例如火焰、粒子与卡通角色),同时严格保持空间与时间一致性。视频视觉特效编辑极具挑战性,因为注入的特效必须与背景无缝融合,背景需完全保持不变,且特效模式需从有限的配对数据中高效学习。然而,现有视频编辑模型无法满足这些要求。IC-Effect将源视频作为干净的上下文条件,利用DiT模型的上下文学习能力,实现精确的背景保持与自然的特效注入。通过包含通用编辑适应与基于Effect-LoRA的特效特定学习的两阶段训练策略,确保了强指令跟随能力与鲁棒的特效建模。为进一步提升效率,我们引入时空稀疏标记化方法,在显著降低计算量的同时保持高保真度。我们还发布了涵盖$15$种高质量视觉风格的配对视觉特效编辑数据集。大量实验表明,IC-Effect能够实现高质量、可控制且时间一致的特效编辑,为视频创作开辟了新可能性。