Large language models (LLMs) inference is both expensive and slow. Local caching of responses offers a practical solution to reduce the cost and latency of LLM queries. In research contexts, caching also enhances reproducibility and provides flexibility for experimentation. However, naive reuse of cached responses compromises statistical independence, a critical property for probabilistic workflows. In applications of LLM for code, it underpins performance metrics such as Pass@k and uncertainty estimation, as well as algorithms like program repair loops and retries. Existing LLM caching systems lack ways to enforce statistical independence constraints. To address this, we introduce Mnimi, a cache design pattern that supports modular LLM workflows while ensuring statistical integrity at the component level. Its core innovation lies in encapsulating statistical constraints within the type of LLM references, allowing users to manage and transform these types according to the scope and requirements of their algorithm. We implemented this design pattern in Python using a combination of decorators and iterators over infinite sequences. A case study on SpecFix, an recent automated program specification repair system, highlights how Mnimi improves reproducibility, ease of debugging, time and cost efficiency while preserving statistical correctness.
翻译:大型语言模型(LLM)的推理过程既昂贵又缓慢。对响应进行本地缓存提供了一种降低LLM查询成本和延迟的实用解决方案。在研究场景中,缓存还能增强可复现性,并为实验提供灵活性。然而,对缓存响应的简单复用会破坏统计独立性,这是概率工作流的一个关键属性。在LLM应用于代码的场景中,统计独立性支撑着诸如Pass@k和不确定性估计等性能指标,以及程序修复循环和重试等算法。现有的LLM缓存系统缺乏强制执行统计独立性约束的方法。为解决此问题,我们提出了Mnimi,一种缓存设计模式,它支持模块化的LLM工作流,同时在组件级别确保统计完整性。其核心创新在于将统计约束封装在LLM引用的类型中,允许用户根据其算法的范围和要求来管理和转换这些类型。我们使用Python,结合装饰器和无限序列迭代器实现了这一设计模式。一项针对近期自动程序规约修复系统SpecFix的案例研究,展示了Mnimi如何在保持统计正确性的同时,提升可复现性、调试便利性以及时间和成本效率。