Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT
翻译:近期基于大扩散模型的主体驱动视频生成技术取得了显著进展,使得能够根据用户提供的主体条件合成个性化内容。然而,现有方法缺乏对主体出现与消失的细粒度时序控制,而这对于组合视频合成、故事板制作及可控动画等应用至关重要。本文提出AlcheMinT,一个引入显式时间戳条件的主体驱动视频生成统一框架。该方法设计了一种新颖的位置编码机制,能够编码与主体身份相关联的时间区间,同时无缝集成预训练视频生成模型的位置嵌入。此外,我们引入主体描述性文本标记以增强视觉身份与视频描述间的绑定关系,减少生成过程中的歧义。通过标记级联操作,AlcheMinT无需额外交叉注意力模块,且引入的参数开销可忽略不计。我们构建了评估多主体身份保持度、视频保真度及时序遵循性的基准测试。实验结果表明,AlcheMinT在视觉质量上达到与当前最优视频个性化方法相当的水平,并首次实现了视频内多主体生成的精确时序控制。项目页面详见 https://snap-research.github.io/Video-AlcheMinT