Step-by-step painting tutorials are vital for learning artistic techniques, but existing video resources (e.g., YouTube) lack interactivity and personalization. While recent generative models have advanced artistic image synthesis, they struggle to generalize across media and often show temporal or structural inconsistencies, hindering faithful reproduction of human creative workflows. To address this, we propose a unified framework for multi-media painting process generation with a semantics-driven style control mechanism that embeds multiple media into a diffusion models conditional space and uses cross-medium style augmentation. This enables consistent texture evolution and process transfer across styles. A reverse-painting training strategy further ensures smooth, human-aligned generation. We also build a large-scale dataset of real painting processes and evaluate cross-media consistency, temporal coherence, and final-image fidelity, achieving strong results on LPIPS, DINO, and CLIP metrics. Finally, our Perceptual Distance Profile (PDP) curve quantitatively models the creative sequence, i.e., composition, color blocking, and detail refinement, mirroring human artistic progression.
翻译:分步绘画教程对于学习艺术技巧至关重要,但现有的视频资源(如YouTube)缺乏互动性和个性化。尽管近期生成模型在艺术图像合成方面取得了进展,但它们难以跨媒介泛化,且常表现出时间或结构上的不一致性,阻碍了对人类创作流程的忠实再现。为此,我们提出一个统一框架用于多媒介绘画过程生成,采用语义驱动的风格控制机制,将多种媒介嵌入扩散模型的条件空间,并利用跨媒介风格增强。这实现了跨风格的一致纹理演化和过程迁移。一种反向绘画训练策略进一步确保了平滑且与人类创作对齐的生成。我们还构建了一个大规模真实绘画过程数据集,评估了跨媒介一致性、时间连贯性和最终图像保真度,在LPIPS、DINO和CLIP指标上取得了优异结果。最后,我们的感知距离剖面(PDP)曲线定量建模了创作序列——即构图、色块划分和细节精修——反映了人类艺术创作的进展过程。