Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
翻译:幻灯片演示文稿与海报等多媒体文档旨在实现交互性与易修改性,然而它们常以静态栅格格式分发,这限制了编辑与定制化操作。要恢复其可编辑性,需将这些栅格图像转换回结构化矢量格式。然而,现有基于曲线、多边形等低层几何基元的栅格-矢量化方法在此任务上存在不足。具体而言,当应用于幻灯片等复杂文档时,这些方法难以保持高层结构,导致生成扁平的形状集合,其中图像与文本元素的语义区分特性丢失。为突破此局限,我们通过提出SliDer框架来解决语义文档逆渲染问题:该创新框架利用视觉语言模型(VLMs),将幻灯片图像逆渲染为紧凑且可编辑的可缩放矢量图形(SVG)表示。SliDer能够检测并提取栅格输入中单个图像与文本元素的属性,并将其组织为连贯的SVG格式。关键在于,该模型在推理过程中通过类人设计思维进行迭代优化,生成的SVG代码在渲染时能更精准地重建原始栅格内容。此外,我们构建了Slide2SVG数据集——该新颖数据集包含从真实科学演示文稿中收集的幻灯片文档栅格-SVG配对样本,以推动该领域的后续研究。实验结果表明,SliDer实现了0.069的重建LPIPS指标,在82.9%的案例中相较于最强的零样本VLM基线方法更受人工评估者青睐。