整个Chain建议基于示范强化学习 (Model-Based Reinforcement Learning for Whole-Chain Recommendations)

With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in developing RL-based recommender systems. In practical recommendation sessions, users will sequentially access multiple scenarios, such as the entrance pages and the item detail pages, and each scenario has its own recommendation strategy. However, the majority of existing RL-based recommender systems focus on separately optimizing each strategy, which could lead to sub-optimal overall performance, because independently optimizing each scenario (i) overlooks the sequential correlation among scenarios, (ii) ignores users' behavior data from other scenarios, and (iii) only optimizes its own objective but neglects the overall objective of a session. Therefore, in this paper, we study the recommendation problem with multiple (consecutive) scenarios, i.e., whole-chain recommendations. We propose a multi-agent reinforcement learning based approach (DeepChain), which can capture the sequential correlation among different scenarios and jointly optimize multiple recommendation strategies. To be specific, all recommender agents share the same memory of users' historical behaviors, and they work collaboratively to maximize the overall reward of a session. Note that optimizing multiple recommendation strategies jointly faces two challenges - (i) it requires huge amounts of user behavior data, and (ii) the distribution of reward (users' feedback) are extremely unbalanced. In this paper, we introduce model-based reinforcement learning techniques to reduce the training data requirement and execute more accurate strategy updates. The experimental results based on data from a real e-commerce platform demonstrate the effectiveness of the proposed framework. Further experiments have been conducted to validate the importance of each component of DeepChain.

翻译：随着最近《强化学习》的普及,在开发基于“强化学习”的推荐系统方面有着巨大的兴趣。在实际建议会议上,用户将依次访问多种设想方案,如入门页和项目详细页面,每个设想方案都有自己的建议战略。然而,基于“强化学习”的现有推荐系统大多数侧重于分别优化每项战略,这可能导致低于最佳的总体业绩,因为独立优化每种设想方案(一)忽视不同设想方案之间的相继关联性,(二)忽视其他设想方案用户的深层行为数据,(三)仅优化自身目标,忽视届会的总体目标。因此,在本文件中,我们用多种(连续)设想方案来研究建议问题,即全链建议方案。我们建议基于多试办强化学习方法(Dep Chain),该方法可以反映不同设想方案之间的相继关系,共同优化多项建议战略。具体地说,所有推荐机构都对用户的历史行为有相同的记忆,并且他们合作地努力最大限度地奖励某届会议的准确性目标。因此,在本文件中,我们用多种(连续的(连续的)预测性数据更新战略中,我们共同展示了一次的多重数据更新数据分析,同时展示了两种数据分析,这要求。