边生成边思考：面向个性化长文本生成的实时推理框架 (Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation)

Preference alignment has enabled large language models (LLMs) to better reflect human expectations, but current methods mostly optimize for population-level preferences, overlooking individual users. Personalization is essential, yet early approaches-such as prompt customization or fine-tuning-struggle to reason over implicit preferences, limiting real-world effectiveness. Recent "think-then-generate" methods address this by reasoning before response generation. However, they face challenges in long-form generation: their static one-shot reasoning must capture all relevant information for the full response generation, making learning difficult and limiting adaptability to evolving content. To address this issue, we propose FlyThinker, an efficient "think-while-generating" framework for personalized long-form generation. FlyThinker employs a separate reasoning model that generates latent token-level reasoning in parallel, which is fused into the generation model to dynamically guide response generation. This design enables reasoning and generation to run concurrently, ensuring inference efficiency. In addition, the reasoning model is designed to depend only on previous responses rather than its own prior outputs, which preserves training parallelism across different positions-allowing all reasoning tokens for training data to be produced in a single forward pass like standard LLM training, ensuring training efficiency. Extensive experiments on real-world benchmarks demonstrate that FlyThinker achieves better personalized generation while keeping training and inference efficiency.

翻译：偏好对齐技术使大语言模型（LLM）能更好地反映人类期望，但现有方法主要针对群体级偏好进行优化，忽略了个体用户的差异。个性化至关重要，然而早期方法（如提示定制或微调）难以对隐式偏好进行推理，限制了实际应用效果。近期“先思考后生成”方法通过在响应生成前进行推理来解决这一问题，但在长文本生成中面临挑战：其静态的单次推理必须捕捉完整响应生成所需的全部相关信息，导致学习困难且难以适应动态演化的内容。为解决该问题，我们提出FlyThinker——一种高效的“边生成边思考”框架，用于个性化长文本生成。FlyThinker采用独立的推理模型并行生成潜在词元级推理结果，并将其融合至生成模型中动态引导响应生成。该设计使推理与生成可并发执行，确保推理效率。此外，推理模型仅依赖先前响应而非自身历史输出，这保持了不同位置间的训练并行性——允许所有训练数据的推理词元像标准LLM训练一样通过单次前向传播生成，从而保障训练效率。在真实场景基准测试上的大量实验表明，FlyThinker在保持训练与推理效率的同时，实现了更优的个性化生成效果。