Contextual Bandits is one of the widely popular techniques used in applications such as personalization, recommendation systems, mobile health, causal marketing etc . As a dynamic approach, it can be more efficient than standard A/B testing in minimizing regret. We propose an end to end automated meta-learning pipeline to approximate the optimal Q function for contextual bandits problems. We see that our model is able to perform much better than random exploration, being more regret efficient and able to converge with a limited number of samples, while remaining very general and easy to use due to the meta-learning approach. We used a linearly annealed e-greedy exploration policy to define the exploration vs exploitation schedule. We tested the system on a synthetic environment to characterize it fully and we evaluated it on some open source datasets to benchmark against prior work. We see that our model outperforms or performs comparatively to other models while requiring no tuning nor feature engineering.
翻译:在个人化、推荐系统、移动健康、因果营销等应用中,背景土匪是广泛流行的技术之一。作为一种动态方法,它比标准的A/B测试效率更高,可以最大限度地减少遗憾。我们提议结束自动元学习管道,以近似背景土匪问题的最佳Q功能。我们看到我们的模型比随机勘探能做得更好,更遗憾地高效,能够与数量有限的样本相匹配,同时由于元学习方法,仍然非常普遍和容易使用。我们使用线性化的电子基因勘探政策来定义勘探与开发时间表。我们用一个线性化的e-greedy勘探政策来对合成环境进行测试,以充分定性,我们用一些开放源数据集对它进行了评估,以比以前的工作做基准。我们发现,我们的模型比其他模型更完美,或比其他模型表现得更好,同时不需要调整或特征工程。