Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER's robust ability to generate multi-object interactions with qualitative examples.
翻译:故事视频生成(SVG)旨在生成遵循结构化叙事、连贯且视觉丰富的多场景视频。现有方法主要利用大语言模型(LLM)进行高层规划,将故事分解为场景级描述,随后独立生成并拼接。然而,这些方法在生成与复杂单场景描述对齐的高质量视频方面存在困难,因为可视化此类复杂描述涉及多个角色与事件的连贯组合、复杂运动合成以及多角色定制。为应对这些挑战,我们提出了DREAMRUNNER,一种新颖的故事到视频生成方法:首先,我们利用大语言模型(LLM)结构化输入脚本,以支持粗粒度场景规划与细粒度对象级布局规划。其次,DREAMRUNNER提出检索增强的测试时自适应方法,以捕捉每个场景中目标对象的运动先验,支持基于检索视频的多样化运动定制,从而促进生成具有复杂脚本化运动的新视频。最后,我们提出了一种新颖的基于时空区域的3D注意力与先验注入模块SR3AI,用于细粒度对象-运动绑定及逐帧时空语义控制。我们将DREAMRUNNER与多种SVG基线方法进行比较,结果表明其在角色一致性、文本对齐和平滑过渡方面达到最先进性能。此外,DREAMRUNNER在组合式文本到视频生成中展现出强大的细粒度条件遵循能力,在T2V-ComBench基准上显著优于基线方法。最后,我们通过定性示例验证了DREAMRUNNER生成多对象交互的鲁棒能力。