网络边缘最佳服务安置学习强化指数政策 (Learning Augmented Index Policy for Optimal Service Placement at the Network Edge)

We consider the problem of service placement at the network edge, in which a decision maker has to choose between $N$ services to host at the edge to satisfy the demands of end users. Our goal is to design adaptive algorithms to minimize the average service delivery latency for users. We pose the problem as a Markov decision process (MDP) in which the system state is given by describing, for each service, the number of users that are currently waiting at the edge to obtain the service. However, solving this $N$-services MDP is computationally expensive due to the curse of dimensionality. To overcome this challenge, we show that the optimal policy for a single-service MDP has an appealing threshold structure, and derive explicitly the Whittle indices for each service as a function of the number of requests from end users based on the theory of Whittle index policy. Since request arrival and service delivery rates are usually unknown and possibly time-varying, we then develop efficient learning augmented algorithms that fully utilize the structure of optimal policies with a low learning regret. The first of these is UCB-Whittle, and relies upon the principle of optimism in the face of uncertainty. The second algorithm, Q-learning-Whittle, utilizes Q-learning iterations for each service by using a two time scale stochastic approximation. We characterize the non-asymptotic performance of UCB-Whittle by analyzing its learning regret, and also analyze the convergence properties of Q-learning-Whittle. Simulation results show that the proposed policies yield excellent empirical performance.

翻译：我们考虑了网络边缘的服务安置问题,在网络边缘,决策者必须在其中选择以美元为主的服务中选择以美元为主的服务,以满足终端用户的需求。我们的目标是设计适应性算法,以尽量减少用户的平均服务延迟度。我们作为Markov 决策程序(MDP)提出问题,根据Whittle指数政策理论,系统状态通过描述每个服务处于最边缘的用户目前等待获得服务的人数来说明。然而,解决这个以美元为单位的服务周期MDP由于程度的诅咒而计算成本高昂。为了克服这一挑战,我们展示了单一服务MDP的最佳政策政策有吸引力的门槛结构,并明确提出了每项服务的惠特尔指数,这是终端用户根据Whittletle指数政策理论提出的要求数量的一种函数。由于每次服务到达和服务交付率通常不为人所知,而且可能具有时间差异,我们随后发展了高效的学习强化算法,充分利用最佳政策结构,但学习的遗憾程度较低。首先是UCB-Witlett,我们展示的是,在每一阶段学习成绩分析结果时,我们也不依赖它使用的乐观原则。