将强化学习模型拟合至多臂老虎机环境下的行为数据 (Fitting Reinforcement Learning Model to Behavioral Data under Bandits)

We consider the problem of fitting a reinforcement learning (RL) model to some given behavioral data under a multi-armed bandit environment. These models have received much attention in recent years for characterizing human and animal decision making behavior. We provide a generic mathematical optimization problem formulation for the fitting problem of a wide range of RL models that appear frequently in scientific research applications, followed by a detailed theoretical analysis of its convexity properties. Based on the theoretical results, we introduce a novel solution method for the fitting problem of RL models based on convex relaxation and optimization. Our method is then evaluated in several simulated bandit environments to compare with some benchmark methods that appear in the literature. Numerical results indicate that our method achieves comparable performance to the state-of-the-art, while significantly reducing computation time. We also provide an open-source Python package for our proposed method to empower researchers to apply it in the analysis of their datasets directly, without prior knowledge of convex optimization.

翻译：本文研究了在多臂老虎机环境下，将强化学习模型拟合至给定行为数据的问题。近年来，这类模型在刻画人类与动物决策行为方面受到广泛关注。我们为科研应用中常见的一系列强化学习模型的拟合问题，提出了通用的数学优化问题表述，并对其凸性进行了详细的理论分析。基于理论结果，我们提出了一种基于凸松弛与优化的强化学习模型拟合新方法。随后，我们在多个模拟老虎机环境中评估了该方法，并与文献中的基准方法进行比较。数值结果表明，我们的方法在达到与前沿方法相当性能的同时，显著减少了计算时间。我们还为所提方法提供了开源Python工具包，使研究人员无需凸优化先验知识即可直接应用于数据集分析。