Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.
翻译:在精确瞬态属性控制下保持序列身份一致性,一直是可控视觉叙事领域长期存在的挑战。现有数据集缺乏足够的保真度,且未能将稳定身份与瞬态属性有效解耦,限制了在姿态、表情和场景构图方面的结构化控制,从而制约了可靠的序列合成。为弥补这一空白,我们提出了 **2K-Characters-10K-Stories**,这是一个多模态风格化叙事数据集,包含 **2,000** 个独特风格化角色,分布于 **10,000** 个插图故事中。这是首个将大规模独特身份与显式解耦控制信号配对,以实现序列身份一致性的数据集。我们引入了一个 **人机协同流程(HiL)**,该流程利用专家验证的角色模板和基于大语言模型的叙事规划,生成高度对齐的结构化数据。**解耦控制方案** 将持久身份与瞬态属性(姿态和表情)分离,同时一个整合了多模态语言模型评估、自动提示调优和局部图像编辑的 **质量门控循环** 确保了像素级一致性。大量实验表明,基于本数据集微调的模型在生成视觉叙事方面,其性能可与闭源模型相媲美。