面向信息年龄成本最小化的策略梯度算法 (Policy Gradient Algorithms for Age-of-Information Cost Minimization)

Recent developments in cyber-physical systems have increased the importance of maximizing the freshness of the information about the physical environment. However, optimizing the access policies of Internet of Things devices to maximize the data freshness, measured as a function of the Age-of-Information (AoI) metric, is a challenging task. This work introduces two algorithms to optimize the information update process in cyber-physical systems operating under the generate-at-will model, by finding an online policy without knowing the characteristics of the transmission delay or the age cost function. The optimization seeks to minimize the time-average cost, which integrates the AoI at the receiver and the data transmission cost, making the approach suitable for a broad range of scenarios. Both algorithms employ policy gradient methods within the framework of model-free reinforcement learning (RL) and are specifically designed to handle continuous state and action spaces. Each algorithm minimizes the cost using a distinct strategy for deciding when to send an information update. Moreover, we demonstrate that it is feasible to apply the two strategies simultaneously, leading to an additional reduction in cost. The results demonstrate that the proposed algorithms exhibit good convergence properties and achieve a time-average cost within 3% of the optimal value, when the latter is computable. A comparison with other state-of-the-art methods shows that the proposed algorithms outperform them in one or more of the following aspects: being applicable to a broader range of scenarios, achieving a lower time-average cost, and requiring a computational cost at least one order of magnitude lower.

翻译：近年来，信息物理系统的发展使得最大化物理环境信息的新鲜度变得日益重要。然而，优化物联网设备的访问策略以最大化数据新鲜度（以信息年龄（Age-of-Information，AoI）度量为函数）是一项具有挑战性的任务。本研究提出了两种算法，用于在按需生成模型下运行的信息物理系统中优化信息更新过程，通过寻找一种无需已知传输延迟特性或年龄成本函数的在线策略。该优化旨在最小化时间平均成本，该成本综合了接收端的AoI和数据传输成本，使得该方法适用于广泛的应用场景。两种算法均在无模型强化学习（RL）框架内采用策略梯度方法，并专门设计用于处理连续状态和动作空间。每种算法通过采用不同的信息更新发送决策策略来最小化成本。此外，我们证明了同时应用这两种策略是可行的，从而进一步降低了成本。结果表明，所提出的算法具有良好的收敛特性，并在最优值可计算的情况下，实现了与最优值相差在3%以内的时间平均成本。与其他先进方法的比较显示，所提出的算法在以下一个或多个方面表现更优：适用于更广泛的场景、实现更低的时间平均成本，以及计算成本至少降低一个数量级。