网络边缘最佳服务安置学习强化指数政策 (Learning Augmented Index Policy for Optimal Service Placement at the Network Edge)

We consider the problem of service placement at the network edge, in which a decision maker has to choose between $N$ services to host at the edge to satisfy the demands of customers. Our goal is to design adaptive algorithms to minimize the average service delivery latency for customers. We pose the problem as a Markov decision process (MDP) in which the system state is given by describing, for each service, the number of customers that are currently waiting at the edge to obtain the service. However, solving this $N$-services MDP is computationally expensive due to the curse of dimensionality. To overcome this challenge, we show that the optimal policy for a single-service MDP has an appealing threshold structure, and derive explicitly the Whittle indices for each service as a function of the number of requests from customers based on the theory of Whittle index policy. Since request arrival and service delivery rates are usually unknown and possibly time-varying, we then develop efficient learning augmented algorithms that fully utilize the structure of optimal policies with a low learning regret. The first of these is UCB-Whittle, and relies upon the principle of optimism in the face of uncertainty. The second algorithm, Q-learning-Whittle, utilizes Q-learning iterations for each service by using a two time scale stochastic approximation. We characterize the non-asymptotic performance of UCB-Whittle by analyzing its learning regret, and also analyze the convergence properties of Q-learning-Whittle. Simulation results show that the proposed policies yield excellent empirical performance.

翻译：我们考虑了网络边缘的服务安置问题,在网络边缘,决策者必须在其中选择以美元为主的服务中选择以美元为主的服务,以满足客户的需求。我们的目标是设计适应性算法,以最大限度地减少客户平均服务延迟度。我们作为一个Markov 决策程序(MDP)提出问题,根据Whittle指数政策理论,系统状态在其中描述每个服务处于边缘的客户数量,说明目前等待获得服务的客户数量。然而,解决这个以美元为单位的服务MDP,由于多元化的诅咒,计算成本高昂。为了克服这一挑战,我们展示了单一服务MDP的最佳政策政策有一个吸引的门槛结构,并明确提出了每项服务的Whitle指数,作为客户根据Whittle指数政策理论提出请求数量的函数。由于每次服务到达和服务交付率通常不为人所知,而且可能存在时间差异,因此我们随后开发了高效的强化算法,充分利用最佳政策结构,但学习的遗憾程度很低。为了克服这一挑战,我们的第一个是UCB-Wittle, 并且依靠在面临不确定性时空分析结果时的乐观原则,利用每个分析结果显示我们学习的进度。