分离学习环境中的序列式斯托卡优化 (Sequential Stochastic Optimization in Separable Learning Environments)

We consider a class of sequential decision-making problems under uncertainty that can encompass various types of supervised learning concepts. These problems have a completely observed state process and a partially observed modulation process, where the state process is affected by the modulation process only through an observation process, the observation process only observes the modulation process, and the modulation process is exogenous to control. We model this broad class of problems as a partially observed Markov decision process (POMDP). The belief function for the modulation process is control invariant, thus separating the estimation of the modulation process from the control of the state process. We call this specially structured POMDP the separable POMDP, or SEP-POMDP, and show it (i) can serve as a model for a broad class of application areas, e.g., inventory control, finance, healthcare systems, (ii) inherits value function and optimal policy structure from a set of completely observed MDPs, (iii) can serve as a bridge between classical models of sequential decision making under uncertainty having fully specified model artifacts and such models that are not fully specified and require the use of predictive methods from statistics and machine learning, and (iv) allows for specialized approximate solution procedures.

翻译：我们认为,一系列具有不确定性的顺序决策问题可能包含各类受监督的学习概念,这些问题具有完全观察状态过程和部分观察的调制过程,国家进程只能通过观察过程受到调制过程的影响,观察过程只观察调制过程,调制过程是控制外的外在因素。我们把这种广泛的问题作为部分观察的Markov决策过程(POMDP)来模拟。调制过程的信念功能是控制不定的,从而将调制过程的估算与国家过程的控制分开。我们称这个结构特殊的POMDP进程称为分解的POMDP,或SEP-POMDP, 并表明它:(一) 可作为广泛应用领域的模型,例如库存控制、金融、保健系统;(二) 将价值功能和最佳政策结构从一套完全观察的MDP(POMDP) 继承价值功能和最佳政策结构。(三) 可以在具有充分定义的模型和这种模型的不确定性下的顺序决定的经典模式之间架桥。我们称POMDP为分离的POMDP,即S-POMDP,或S-POMDP,并表明它(i)可以作为广泛使用预测方法的模型的模型。