Automatic presentation slide generation can greatly streamline content creation. However, since preferences of each user may vary, existing under-specified formulations often lead to suboptimal results that fail to align with individual user needs. We introduce a novel task that conditions paper-to-slides generation on user-specified preferences. We propose a human behavior-inspired agentic framework, SlideTailor, that progressively generates editable slides in a user-aligned manner. Instead of requiring users to write their preferences in detailed textual form, our system only asks for a paper-slides example pair and a visual template - natural and easy-to-provide artifacts that implicitly encode rich user preferences across content and visual style. Despite the implicit and unlabeled nature of these inputs, our framework effectively distills and generalizes the preferences to guide customized slide generation. We also introduce a novel chain-of-speech mechanism to align slide content with planned oral narration. Such a design significantly enhances the quality of generated slides and enables downstream applications like video presentations. To support this new task, we construct a benchmark dataset that captures diverse user preferences, with carefully designed interpretable metrics for robust evaluation. Extensive experiments demonstrate the effectiveness of our framework.
翻译:自动演示文稿生成能够极大简化内容创作流程。然而,由于不同用户的偏好存在差异,现有方法因任务定义不充分常产生难以满足个性化需求的结果。本文提出一种基于用户指定偏好的论文到演示文稿生成新任务,并设计了一种受人类行为启发的智能体框架SlideTailor,该框架能够以用户对齐的方式逐步生成可编辑的演示文稿。我们的系统无需用户以详细文本形式描述偏好,仅需提供论文-演示文稿示例对和视觉模板——这些自然且易于提供的素材隐式编码了用户在内容组织与视觉风格方面的丰富偏好。尽管输入信息具有隐式性和未标注特性,本框架仍能有效提炼并泛化用户偏好以指导定制化演示文稿生成。我们还提出了一种新颖的链式语音机制,使演示文稿内容与规划的口头叙述保持对齐。该设计显著提升了生成演示文稿的质量,并支持视频演示等下游应用。为支撑此新任务,我们构建了涵盖多样化用户偏好的基准数据集,并设计了可解释的评估指标以实现鲁棒性能评估。大量实验证明了本框架的有效性。