Generating humorous memes is a challenging multimodal task that moves beyond direct image-to-caption supervision. It requires a nuanced reasoning over visual content, contextual cues, and subjective humor. To bridge this gap between visual perception and humorous punchline creation, we propose HUMOR}, a novel framework that guides VLMs through hierarchical reasoning and aligns them with group-wise human preferences. First, HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT): the model begins by identifying a template-level intent, then explores diverse reasoning paths under different contexts, and finally anchors onto a high-quality, context-specific path. This CoT supervision, which traces back from ground-truth captions, enhances reasoning diversity. We further analyze that this multi-path exploration with anchoring maintains a high expected humor quality, under the practical condition that high-quality paths retain significant probability mass. Second, to capture subjective humor, we train a pairwise reward model that operates within groups of memes sharing the same template. Following established theory, this approach ensures a consistent and robust proxy for human preference, even with subjective and noisy labels. The reward model then enables a group-wise reinforcement learning optimization, guaranteeing providing a theoretical guarantee for monotonic improvement within the trust region. Extensive experiments show that HUMOR empowers various VLMs with superior reasoning diversity, more reliable preference alignment, and higher overall meme quality. Beyond memes, our work presents a general training paradigm for open-ended, human-aligned multimodal generation, where success is guided by comparative judgment within coherent output group.
翻译:生成幽默迷因是一项具有挑战性的多模态任务,它超越了直接的图像到字幕监督。该任务需要对视觉内容、上下文线索和主观幽默感进行细致推理。为弥合视觉感知与幽默笑点创作之间的鸿沟,我们提出了HUMOR框架,该框架通过分层推理引导视觉语言模型,并将其与分组式人类偏好对齐。首先,HUMOR采用分层多路径思维链:模型从识别模板级意图开始,随后在不同语境下探索多样化推理路径,最终锚定于高质量、语境特定的路径。这种从真实标注回溯的思维链监督增强了推理多样性。我们进一步分析表明,在高质量路径保持显著概率质量的现实条件下,这种带锚定机制的多路径探索能维持较高的期望幽默质量。其次,为捕捉主观幽默感,我们训练了一个在共享相同模板的迷因组内运作的成对奖励模型。依据既有理论,该方法即使面对主观且有噪声的标注,也能确保获得一致且稳健的人类偏好代理。该奖励模型进而支持分组式强化学习优化,为信任域内的单调改进提供了理论保证。大量实验表明,HUMOR能赋能多种视觉语言模型,使其具备更优越的推理多样性、更可靠的偏好对齐以及更高的整体迷因质量。除迷因生成外,本研究为开放式、人类对齐的多模态生成任务提出了通用训练范式,其成功关键在于对连贯输出组内结果的比较性判断。