In $\mathcal{X}$-armed bandit problem an agent sequentially interacts with environment which yields a reward based on the vector input the agent provides. The agent's goal is to maximise the sum of these rewards across some number of time steps. The problem and its variations have been a subject of numerous studies, suggesting sub-linear and some times optimal strategies. The given paper introduces a novel variation of the problem. We consider an environment, which can abruptly change its behaviour an unknown number of times. To that end we propose a novel strategy and prove it attains sub-linear cumulative regret. Moreover, in case of highly smooth relation between an action and the corresponding reward, the method is nearly optimal. The theoretical result are supported by experimental study.

### 相关内容

In order to model risk aversion in reinforcement learning, an emerging line of research adapts familiar algorithms to optimize coherent risk functionals, a class that includes conditional value-at-risk (CVaR). Because optimizing the coherent risk is difficult in Markov decision processes, recent work tends to focus on the Markov coherent risk (MCR), a time-consistent surrogate. While, policy gradient (PG) updates have been derived for this objective, it remains unclear (i) whether PG finds a global optimum for MCR; (ii) how to estimate the gradient in a tractable manner. In this paper, we demonstrate that, in general, MCR objectives (unlike the expected return) are not gradient dominated and that stationary points are not, in general, guaranteed to be globally optimal. Moreover, we present a tight upper bound on the suboptimality of the learned policy, characterizing its dependence on the nonlinearity of the objective and the degree of risk aversion. Addressing (ii), we propose a practical implementation of PG that uses state distribution reweighting to overcome previous limitations. Through experiments, we demonstrate that when the optimality gap is small, PG can learn risk-sensitive policies. However, we find that instances with large suboptimality gaps are abundant and easy to construct, outlining an important challenge for future research.

A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modification. In this paper, we study a contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and MinWD by deriving both regret and reward bounds compared to an oracle that knows the true context. Our results show that as time goes on, MaxMinUCB and MinWD both perform as asymptotically well as their optimal counterparts that know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge datacenter selection, and run synthetic simulations to validate our theoretical analysis.

Swarms of autonomous agents are useful in many applications due to their ability to accomplish tasks in a decentralized manner, making them more robust to failures. Due to the difficulty in running experiments with large numbers of hardware agents, researchers often make simplifying assumptions and remove constraints that might be present in a real swarm deployment. While simplifying away some constraints is tolerable, we feel that two in particular have been overlooked: one, that agents in a swarm take up physical space, and two, that agents might be damaged in collisions. Many existing works assume agents have negligible size or pass through each other with no added penalty. It seems possible to ignore these constraints using collision avoidance, but we show using an illustrative example that this is easier said than done. In particular, we show that collision avoidance can interfere with the intended swarming behavior and significant parameter tuning is necessary to ensure the behavior emerges as best as possible while collisions are avoided. We compare four different collision avoidance algorithms, two of which we consider to be the best decentralized collision avoidance algorithms available. Despite putting significant effort into tuning each algorithm to perform at its best, we believe our results show that further research is necessary to develop swarming behaviors that can achieve their goal while avoiding collisions with agents of non-negligible volume.

We consider the algorithmic problem of finding a near-optimal solution for the number partitioning problem (NPP). The NPP appears in many applications, including the design of randomized controlled trials, multiprocessor scheduling, and cryptography; and is also of theoretical significance. It possesses a so-called statistical-to-computational gap: when its input $X$ has distribution $\mathcal{N}(0,I_n)$, its optimal value is $\Theta(\sqrt{n}2^{-n})$ w.h.p.; whereas the best polynomial-time algorithm achieves an objective value of only $2^{-\Theta(\log^2 n)}$, w.h.p. In this paper, we initiate the study of the nature of this gap. Inspired by insights from statistical physics, we study the landscape of NPP and establish the presence of the Overlap Gap Property (OGP), an intricate geometric property which is known to be a rigorous evidence of an algorithmic hardness for large classes of algorithms. By leveraging the OGP, we establish that (a) any sufficiently stable algorithm, appropriately defined, fails to find a near-optimal solution with energy below $2^{-\omega(n \log^{-1/5} n)}$; and (b) a very natural MCMC dynamics fails to find near-optimal solutions. Our simulations suggest that the state of the art algorithm achieving $2^{-\Theta(\log^2 n)}$ is indeed stable, but formally verifying this is left as an open problem. OGP regards the overlap structure of $m-$tuples of solutions achieving a certain objective value. When $m$ is constant we prove the presence of OGP in the regime $2^{-\Theta(n)}$, and the absence of it in the regime $2^{-o(n)}$. Interestingly, though, by considering overlaps with growing values of $m$ we prove the presence of the OGP up to the level $2^{-\omega(\sqrt{n\log n})}$. Our proof of the failure of stable algorithms at values $2^{-\omega(n \log^{-1/5} n)}$ employs methods from Ramsey Theory from the extremal combinatorics, and is of independent interest.

In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to train novice workers of unknown quality in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from $T$ pulls, we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls, and aim to maximize the expected value of the highest cumulative reward across the $K$ arms. This corresponds to the objective of training a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and an instance-independent regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy, featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). Our numerical experiments demonstrate the efficacy of this policy compared to several natural alternatives in practical parameter regimes.

We consider the problem of computing homogeneous coordinates of points in a zero-dimensional subscheme of a compact, complex toric variety $X$. Our starting point is a homogeneous ideal $I$ in the Cox ring of $X$, which in practice might arise from homogenizing a sparse polynomial system. We prove a new eigenvalue theorem in the toric compact setting, which leads to a novel, robust numerical approach for solving this problem. Our method works in particular for systems having isolated solutions with arbitrary multiplicities. It depends on the multigraded regularity properties of $I$. We study these properties and provide bounds on the size of the matrices involved in our approach in the case where $I$ is a complete intersection.

In this paper, we consider online continuous DR-submodular maximization with linear stochastic long-term constraints. Compared to the prior work on online submodular maximization, our setting introduces the extra complication of stochastic linear constraint functions that are i.i.d. generated at each round. To be precise, at step $t\in\{1,\dots,T\}$, a DR-submodular utility function $f_t(\cdot)$ and a constraint vector $p_t$, i.i.d. generated from an unknown distribution with mean $p$, are revealed after committing to an action $x_t$ and we aim to maximize the overall utility while the expected cumulative resource consumption $\sum_{t=1}^T \langle p,x_t\rangle$ is below a fixed budget $B_T$. Stochastic long-term constraints arise naturally in applications where there is a limited budget or resource available and resource consumption at each step is governed by stochastically time-varying environments. We propose the Online Lagrangian Frank-Wolfe (OLFW) algorithm to solve this class of online problems. We analyze the performance of the OLFW algorithm and we obtain sub-linear regret bounds as well as sub-linear cumulative constraint violation bounds, both in expectation and with high probability.

The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds $T$. However, there is a gap of $\tilde{O}(\max(\sqrt{d}, \sqrt{k}))$ between the current best upper and lower bounds, where $d$ is the dimension of the feature vectors, $k$ is the number of the chosen arms in a round, and $\tilde{O}(\cdot)$ ignores the logarithmic factors. The dependence of $k$ and $d$ is of practical importance because $k$ may be larger than $T$ in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C${}^2$UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound $\tilde{O}(d\sqrt{kT} + dk)$ for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C${}^2$UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.

In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to be suboptimal in many cases--our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.

We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

Audrey Huang,Liu Leqi,Zachary C. Lipton,Kamyar Azizzadenesheli
0+阅读 · 3月4日
Jianyi Yang,Shaolei Ren
0+阅读 · 3月4日
David Gamarnik,Eren C. Kızıldağ
0+阅读 · 3月2日
Eren Ozbay,Vijay Kamble
0+阅读 · 3月1日
Matías R. Bender,Simon Telen
0+阅读 · 3月1日
0+阅读 · 2月28日
Kei Takemura,Shinji Ito,Daisuke Hatano,Hanna Sumita,Takuro Fukunaga,Naonori Kakimura,Ken-ichi Kawarabayashi
0+阅读 · 2月27日
Andrew Wagenmaker,Julian Katz-Samuels,Kevin Jamieson
0+阅读 · 2月27日
Brendan O'Donoghue
3+阅读 · 2018年7月25日

29+阅读 · 2020年12月14日

20+阅读 · 2019年10月11日

45+阅读 · 2019年10月11日

18+阅读 · 2019年10月11日

37+阅读 · 2019年10月9日

CreateAMind
9+阅读 · 2019年5月24日
CreateAMind
9+阅读 · 2019年5月22日
CreateAMind
6+阅读 · 2019年1月26日
CreateAMind
4+阅读 · 2019年1月18日
CreateAMind
6+阅读 · 2019年1月7日
CreateAMind
8+阅读 · 2019年1月2日
CreateAMind
4+阅读 · 2018年12月17日

4+阅读 · 2017年11月2日
CreateAMind
9+阅读 · 2017年8月2日
CreateAMind
8+阅读 · 2017年7月21日
Top