评估在现实世界限制下加强学习用交通信号控制 (Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations)

Adaptive traffic signal control is one key avenue for mitigating the growing consequences of traffic congestion. Incumbent solutions such as SCOOT and SCATS require regular and time-consuming calibration, can't optimise well for multiple road use modalities, and require the manual curation of many implementation plans. A recent alternative to these approaches are deep reinforcement learning algorithms, in which an agent learns how to take the most appropriate action for a given state of the system. This is guided by neural networks approximating a reward function that provides feedback to the agent regarding the performance of the actions taken, making it sensitive to the specific reward function chosen. Several authors have surveyed the reward functions used in the literature, but attributing outcome differences to reward function choice across works is problematic as there are many uncontrolled differences, as well as different outcome metrics. This paper compares the performance of agents using different reward functions in a simulation of a junction in Greater Manchester, UK, across various demand profiles, subject to real world constraints: realistic sensor inputs, controllers, calibrated demand, intergreen times and stage sequencing. The reward metrics considered are based on the time spent stopped, lost time, change in lost time, average speed, queue length, junction throughput and variations of these magnitudes. The performance of these reward functions is compared in terms of total waiting time. We find that speed maximisation resulted in the lowest average waiting times across all demand levels, displaying significantly better performance than other rewards previously introduced in the literature.

翻译：适应性交通信号控制是减轻交通拥堵日益加剧的后果的关键途径之一。 SCOOT 和 SCATS 等阻碍性解决方案需要定期和耗时校准,无法优化多条道路使用模式,需要手工整理许多实施计划。这些方法的最近替代办法是深强化学习算法,其中代理商学习如何为系统特定状态采取最适当的行动。这是由神经网络指导的,它向代理商提供有关所采取行动表现的反馈的奖励功能,使其对所选择的具体奖励功能敏感。一些作者调查了文献中使用的奖励功能,但将结果差异归结为不同工程的奖赏功能选择是有问题的,因为有许多不受控制的差别,以及不同的成果指标。本文比较了代理人在模拟英国大曼彻斯特公司各种需求状况时使用不同的奖赏功能的表现,但取决于现实世界的制约因素:现实感应投入、控制者、校正需求、错需求、不同绿色时间和阶段顺序的排序。考虑的奖赏指标是根据所花的时间长度、以往平均要求、不同水平的成绩变化、我们平均要求中的平均时间变化,这些平均要求中的成绩变化。