成果强化学习:工具变量方法 (Causal Reinforcement Learning: An Instrumental Variable Approach)

In the standard data analysis framework, data is first collected (once for all), and then data analysis is carried out. With the advancement of digital technology, decisionmakers constantly analyze past data and generate new data through the decisions they make. In this paper, we model this as a Markov decision process and show that the dynamic interaction between data generation and data analysis leads to a new type of bias -- reinforcement bias -- that exacerbates the endogeneity problem in standard data analysis. We propose a class of instrument variable (IV)-based reinforcement learning (RL) algorithms to correct for the bias and establish their asymptotic properties by incorporating them into a two-timescale stochastic approximation framework. A key contribution of the paper is the development of new techniques that allow for the analysis of the algorithms in general settings where noises feature time-dependency. We use the techniques to derive sharper results on finite-time trajectory stability bounds: with a polynomial rate, the entire future trajectory of the iterates from the algorithm fall within a ball that is centered at the true parameter and is shrinking at a (different) polynomial rate. We also use the technique to provide formulas for inferences that are rarely done for RL algorithms. These formulas highlight how the strength of the IV and the degree of the noise's time dependency affect the inference.

翻译：在标准数据分析框架内,首先收集数据(一成不变),然后进行数据分析。随着数字技术的进步,决策者不断分析过去的数据,并通过他们做出的决定生成新数据。在本文中,我们将此模型作为Markov 决策程序,并表明数据生成和数据分析之间的动态互动导致一种新的偏差类型 -- -- 强化偏差 -- -- 加剧标准数据分析中的内分性问题。我们建议了一类基于仪器变量(四)的强化学习算法,以纠正偏差,并通过将其纳入两个尺度的随机近似框架来确定其无症状特性。本文的一项关键贡献是开发新技术,以便能够分析一般环境中的算法,其中噪音具有时间依赖性。我们使用这些技术来得出关于定时轨道稳定性的更清晰的结果:由于一个多元率,从算法中得出的整个未来轨迹轨迹位于一个球内,该球以真实参数为中心,并且正在缩小其无症状特征特征,并且正在缩小一个(不同的)定量公式的弹性度。我们用这些技术来显示这些微缩度的四级公式。