In this work we discuss the incorporation of quadratic neurons into policy networks in the context of model-free actor-critic reinforcement learning. Quadratic neurons admit an explicit quadratic function approximation in contrast to conventional approaches where the the non-linearity is induced by the activation functions. We perform empiric experiments on several MuJoCo continuous control tasks and find that when quadratic neurons are added to MLP policy networks those outperform the baseline MLP whilst admitting a smaller number of parameters. The top returned reward is in average increased by $5.8\%$ while being about $21\%$ more sample efficient. Moreover, it can maintain its advantage against added action and observation noise.
翻译:在这项工作中,我们讨论将二次神经元纳入政策网络的问题,在不使用模型的行为者-批评强化学习的范围内; 二次神经元承认明确的二次函数近似值,与非线性是由激活功能引起的常规方法形成对比; 我们在几项Mujoco连续控制任务上进行试验,发现当四级神经元加入多边劳工伙伴关系政策网络时,那些超标的MLP,但承认较少的参数; 最高回报的奖励平均增加5.8美元,而样本效率则增加约210美元; 此外,它可以保持其优势,防止增加的行动和观测噪音。