Alignment of Large Language Models (LLMs) typically relies on Reinforcement Learning from Human Feedback (RLHF) with gradient-based optimizers such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). While effective, these methods require complex distributed training, large memory budgets, and careful hyperparameter tuning, all of which become increasingly difficult at billion-parameter scale. We present ESSA, Evolutionary Strategies for Scalable Alignment, a gradient-free framework that aligns LLMs using only forward inference and black-box optimization. ESSA focuses optimization on Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from an singular value decomposition (SVD) of each adapter matrix. This dimensionality reduction makes evolutionary search practical even for very large models and allows efficient operation in quantized INT4 and INT8 inference mode. Across these benchmarks ESSA improves the test accuracy of Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8% on PRM800K, and raises the accuracy of LLaMA3.1-8B on IFEval by 22.5%, all compared with GRPO. In large-scale settings ESSA shows stronger scaling than gradient-based methods: on Qwen2.5-32B for PRM800K it reaches near-optimal accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs compared with GRPO. These results position evolutionary strategies as a compelling, hardware-friendly alternative to gradient-based LLM alignment, combining competitive quality with substantially reduced wall-clock time and engineering overhead.
翻译:大型语言模型(LLM)的对齐通常依赖于基于人类反馈的强化学习(RLHF),并采用梯度优化器(如近端策略优化(PPO)或组相对策略优化(GRPO))。尽管这些方法有效,但它们需要复杂的分布式训练、庞大的内存预算以及精细的超参数调优,这些在十亿参数规模下都变得日益困难。本文提出ESSA(面向可扩展对齐的进化策略),这是一种无需梯度的框架,仅通过前向推理和黑盒优化即可实现LLM的对齐。ESSA将优化集中于低秩适配器(LoRA),并通过仅优化每个适配器矩阵奇异值分解(SVD)后的奇异值,进一步压缩其参数空间。这种降维处理使得进化搜索即使对于超大规模模型也变得可行,并允许在量化INT4和INT8推理模式下高效运行。在多项基准测试中,与GRPO相比,ESSA将Qwen2.5-Math-7B在GSM8K上的测试准确率提升了12.6%,在PRM800K上提升了14.8%,并将LLaMA3.1-8B在IFEval上的准确率提高了22.5%。在大规模场景下,ESSA展现出比基于梯度的方法更强的扩展性:在PRM800K任务上,对于Qwen2.5-32B模型,使用16个GPU时达到接近最优准确率的速度是GRPO的2倍,使用128个GPU时达到6倍。这些结果表明,进化策略作为一种硬件友好的LLM对齐替代方案,具有竞争力,能够在保证性能的同时显著减少训练时间和工程开销。