We consider an online two-stage stochastic optimization with long-term constraints over a finite horizon of $T$ periods. At each period, we take the first-stage action, observe a model parameter realization and then take the second-stage action from a feasible set that depends both on the first-stage decision and the model parameter. We aim to minimize the cumulative objective value while guaranteeing that the long-term average second-stage decision belongs to a set. We propose a general algorithmic framework that derives online algorithms for the online two-stage problem from adversarial learning algorithms. Also, the regret bound of our algorithm cam be reduced to the regret bound of embedded adversarial learning algorithms. Based on our framework, we obtain new results under various settings. When the model parameter at each period is drawn from identical distributions, we derive state-of-art regret bound that improves previous bounds under special cases. Our algorithm is also robust to adversarial corruptions of model parameter realizations. When the model parameters are drawn from unknown non-stationary distributions and we are given prior estimates of the distributions, we develop a new algorithm from our framework with a regret $O(W_T+\sqrt{T})$, where $W_T$ measures the total inaccuracy of the prior estimates.
翻译:我们考虑的是具有长期限制的双阶段在线两阶段随机优化,其期限为$T的限定值。 在每一期间,我们采取第一阶段行动,观察模型参数实现情况,然后从一个取决于第一阶段决定和模型参数的可行数据集中采取第二阶段行动。我们的目标是尽量减少累积目标值,同时保证长期平均平均二级决定属于一组。我们提出了一个一般算法框架,从对抗性学习算法中得出在线两阶段问题的在线算法。此外,我们的算法卡的遗憾程度降低到嵌入的对抗性学习算法的遗憾程度。我们根据我们的框架,在各种情况下,我们获得新的结果。当每个期间的模型参数从相同的分布中抽取时,我们得出了改进特殊情况下以往界限的状态的遗憾。我们的算法对于模型参数实现的对抗性腐败也很有力。当模型参数是从未知的非固定分布法中提取的,并且我们事先对分布作了估计时,我们从我们的框架中以美元为单位的美元/美元/美元总估测,我们从这个框架中制定了一种新的算法。