利用最优政策优化,对固定翼无人驾驶航空器进行深度强化学习控制 (Deep Reinforcement Learning Attitude Control of Fixed-Wing UAVs Using Proximal Policy Optimization)

Contemporary autopilot systems for unmanned aerial vehicles (UAVs) are far more limited in their flight envelope as compared to experienced human pilots, thereby restricting the conditions UAVs can operate in and the types of missions they can accomplish autonomously. This paper proposes a deep reinforcement learning (DRL) controller to handle the nonlinear attitude control problem, enabling extended flight envelopes for fixed-wing UAVs. A proof-of-concept controller using the proximal policy optimization (PPO) algorithm is developed, and is shown to be capable of stabilizing a fixed-wing UAV from a large set of initial conditions to reference roll, pitch and airspeed values. The training process is outlined and key factors for its progression rate are considered, with the most important factor found to be limiting the number of variables in the observation vector, and including values for several previous time steps for these variables. The trained reinforcement learning (RL) controller is compared to a proportional-integral-derivative (PID) controller, and is found to converge in more cases than the PID controller, with comparable performance. Furthermore, the RL controller is shown to generalize well to unseen disturbances in the form of wind and turbulence, even in severe disturbance conditions.

翻译：无人驾驶航空器(无人驾驶航空器)的当代自动驾驶系统与有经验的人类飞行员相比,其飞行封套中的限制程度要大得多,从而限制了无人驾驶航空器在飞行任务中运作的条件和它们能够自主完成的任务类型。本文件建议设立一个深强化学习(DRL)控制器,以处理非线性姿态控制问题,使固定翼无人驾驶航空器能够使用扩展的飞行封套。开发了一个使用准度政策优化算法的校正概念控制器,并证明它能够稳定固定翼无人驾驶飞行器,从大量初始条件到参考滚动、定位和空中速度值,从而限制无人驾驶飞行器在飞行任务中运作的条件和它们能够自动完成的任务类型。本文件概述了培训过程,并考虑了其进展速度的关键因素,其中最重要的因素是限制观测矢量中的变量数量,包括这些变量以前几个时间步骤的价值。经过培训的增强控制器与比例性偏差-德尔维(PID)控制器进行了比较,并发现比PID控制器更为常见的情况,其性能相当。此外,RL控制器在剧烈的风乱和剧烈的状态下被普遍地显示为不可测。