Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($>$2.2$\times$) in some workflows dominated by redundant computation.
翻译:大型语言模型正日益部署于多智能体工作流中。本文提出提示编排框架,该框架通过维护动态的全局键值缓存来高效执行LLM工作流。每个LLM调用可关注任意重新排序的先前编码消息子集,并支持并行调用。尽管缓存消息编码有时会产生与在新上下文中重新编码不同的结果,但我们在多样化场景中证明:通过微调LLM使其适配缓存机制,可帮助模型复现原始结果。提示编排显著降低了单消息延迟(首词生成速度提升2.0-6.2倍),并在某些以冗余计算为主的工作流中实现了显著的端到端加速(超过2.2倍)。