对短视频建议进行两步制控 (Two-Stage Constrained Actor-Critic for Short Video Recommendation)

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

翻译：社交媒体上短视频广受欢迎为优化视频共享平台上的建议系统带来了新的机遇和挑战。用户与系统相继互动,提供复杂和多面的响应,包括观察时间和多种视频的各类互动。一方面,平台旨在长期优化用户的累积观察时间(主要目标),通过强化学习可以有效地优化每个辅助信号。另一方面,平台还需要满足满足满足满足多种用户互动(辅助目标)反应的制约,例如,跟踪、共享等。在本文中,我们将短视频建议问题作为用户培训的马尔科夫决策程序(CMDP)来阐述。我们发现,传统的强化学习算法在这种环境下不能很好地发挥作用。我们提出一个新的两阶段限制的行为者激励方法:在第一阶段,我们学习单项政策以优化每个辅助信号。在第二阶段,我们学习一项政策,以便(一)优化主要信号,(二)在第一阶段与所学习的政策保持接近,从而有效保证这一主要政策在用户培训的马尔科夫决策过程(CMDP)中的表现。我们发现,传统的强化学习算法在这种环境中不会成功。我们所推出的替代方法时,我们进一步展示了其他目标的模型的优势。我们用方法来进一步平衡了我们所推出的模拟方法。