This paper presents a novel approach for unsupervised video summarization using reinforcement learning (RL), addressing limitations like unstable adversarial training and reliance on heuristic-based reward functions. The method operates on the principle that reconstruction fidelity serves as a proxy for informativeness, correlating summary quality with reconstruction ability. The summarizer model assigns importance scores to frames to generate the final summary. For training, RL is coupled with a unique reward generation pipeline that incentivizes improved reconstructions. This pipeline uses a generator model to reconstruct the full video from the selected summary frames; the similarity between the original and reconstructed video provides the reward signal. The generator itself is pre-trained self-supervisedly to reconstruct randomly masked frames. This two-stage training process enhances stability compared to adversarial architectures. Experimental results show strong alignment with human judgments and promising F-scores, validating the reconstruction objective.
翻译:本文提出了一种利用强化学习(RL)进行无监督视频摘要的新方法,旨在解决对抗训练不稳定和依赖启发式奖励函数等局限性。该方法基于以下原理:重建保真度可作为信息量的代理指标,使摘要质量与重建能力相关联。摘要模型通过为视频帧分配重要性分数来生成最终摘要。在训练过程中,强化学习与独特的奖励生成流程相结合,该流程通过激励改进重建效果来优化模型。该流程使用生成器模型从选定的摘要帧重建完整视频;原始视频与重建视频之间的相似度构成奖励信号。生成器本身通过自监督预训练来重建随机掩蔽的帧。与对抗式架构相比,这种两阶段训练过程提升了稳定性。实验结果表明,该方法与人工评估高度一致,并取得了优异的F分数,验证了重建目标的有效性。