Prompt-R1：基于端到端强化学习的协作式自动提示框架 (Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning)

Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

翻译：近年来，先进的大语言模型（LLMs）正以前所未有的速度涌现。然而，面对复杂问题时，大多数用户往往难以提供准确有效的提示来与LLMs交互，从而限制了LLMs的性能。为应对这一挑战，我们提出了Prompt-R1——一种端到端的强化学习框架，该框架利用小规模LLM与大规模LLMs进行协作，替代用户交互以更好地解决问题。这种协作被构建为多轮提示交互过程：小规模LLM负责思考并生成提示，而大规模LLM则执行复杂推理。我们设计了一种双重约束奖励机制，以同时优化答案正确性、生成质量与推理准确性。Prompt-R1提供了一个即插即用的框架，支持与多种大规模LLMs进行推理及训练。在多个公开数据集上的实验表明，Prompt-R1在不同任务上均显著优于基线模型。我们的代码已公开于https://github.com/QwenQKing/Prompt-R1。