在混合和延迟奖励下通过强化学习进行递增投标 (Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards)

Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.

翻译：递增性, 用来衡量向潜在客户( 如互联网平台上的用户) 展示广告的因果关系, 而不是向在线广告平台上显示广告的用户), 是在线广告平台上广告商的一个中心对象。本文调查广告商如何学会以在线方式优化投标序列的问题 \ emph{ 没有事先知道递增性参数。我们把这个问题的离线版本编成一个特别结构化的单项Markov 决策程序( MDP ), 然后对在线学习对应方提出一个新的强化学习( RL) 算法, 最遗憾的是$\ 全域蒂尔德{O} (H2\\ sqrt{T}) $, 取决于回合数 $H$ 和单数 $T$, 但不取决于行动数量( 即可能的出价 ) 。我们的学习问题与标准RL 问题的一个根本区别是, 从转换递增性得到的奖励反馈是 \ emph{ mixed} 和\ emph{delayed} 。要处理并分析这个困难, 我们所相信的渐进式瞬间利益转换为学习的递增式算法。