Deep Q-learning algorithms remain notoriously unstable, especially during early training when the maximization operator amplifies estimation errors. Inspired by bounded rationality theory and developmental learning, we introduce Sat-EnQ, a two-phase framework that first learns to be ``good enough'' before optimizing aggressively. In Phase 1, we train an ensemble of lightweight Q-networks under a satisficing objective that limits early value growth using a dynamic baseline, producing diverse, low-variance estimates while avoiding catastrophic overestimation. In Phase 2, the ensemble is distilled into a larger network and fine-tuned with standard Double DQN. We prove theoretically that satisficing induces bounded updates and cannot increase target variance, with a corollary quantifying conditions for substantial reduction. Empirically, Sat-EnQ achieves 3.8x variance reduction, eliminates catastrophic failures (0% vs 50% for DQN), maintains 79% performance under environmental noise}, and requires 2.5x less compute than bootstrapped ensembles. Our results highlight a principled path toward robust reinforcement learning by embracing satisficing before optimization.
翻译:深度Q学习算法仍然极不稳定,尤其是在训练早期,最大化算子会放大估计误差。受有限理性理论和发育学习启发,我们提出了Sat-EnQ——一个先学习“足够好”再激进优化的两阶段框架。在第一阶段,我们在满意化目标下训练一组轻量级Q网络,该目标通过动态基线限制早期价值增长,从而产生多样化、低方差的估计,同时避免灾难性的高估。在第二阶段,集成模型被蒸馏到一个更大的网络中,并使用标准Double DQN进行微调。我们从理论上证明了满意化会诱导有界更新且不会增加目标方差,并给出一个推论量化了方差显著降低的条件。实验表明,Sat-EnQ实现了3.8倍的方差降低,消除了灾难性故障(DQN为50%,而Sat-EnQ为0%),在环境噪声下保持79%的性能,并且所需计算量比自举集成方法少2.5倍。我们的结果凸显了一条通过先满意化再优化的原则性路径,以实现稳健的强化学习。