Predictive-LoRA：一种面向基于LoRA的大语言模型的主动式且感知内存碎片化的无服务器推理系统 (Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs)

The serverless computing paradigm offers compelling advantages for deploying Large Language Model (LLM) inference services, including elastic scaling and pay-per-use billing. However, serving multiple fine-tuned LLMs via Low-Rank Adaptation (LoRA) in serverless environments faces critical challenges: reactive adapter loading causes significant cold start latency, and frequent adapter swapping leads to severe GPU memory fragmentation. In this paper, we present Predictive-LoRA (P-LoRA), a proactive and fragmentation-aware serverless inference system for LoRA-based LLMs. P-LoRA introduces two key innovations: (1) a lightweight LSTM-based traffic predictor that forecasts adapter demand and proactively prefetches hot adapters from host memory to GPU, reducing cold start latency by up to 68%; and (2) a page-based adapter memory management mechanism inspired by operating system virtual memory, which keeps GPU memory utilization above 87% even under heterogeneous adapter ranks. We evaluate P-LoRA using production-like workloads derived from the Azure Functions trace. Experimental results demonstrate that P-LoRA achieves 1.52x higher throughput than S-LoRA while reducing the average Time-To-First-Token (TTFT) by 35% under high concurrency scenarios.

翻译：无服务器计算范式为部署大语言模型推理服务提供了引人注目的优势，包括弹性扩展和按使用付费。然而，在无服务器环境中通过低秩适配技术服务多个经过微调的大语言模型面临关键挑战：反应式的适配器加载会导致显著的冷启动延迟，而频繁的适配器交换会导致严重的GPU内存碎片化。本文提出Predictive-LoRA，一种面向基于LoRA的大语言模型的主动式且感知内存碎片化的无服务器推理系统。P-LoRA引入了两项关键创新：（1）一个轻量级的基于LSTM的流量预测器，用于预测适配器需求并主动将热适配器从主机内存预取到GPU，将冷启动延迟降低高达68%；以及（2）一种受操作系统虚拟内存启发的、基于页面的适配器内存管理机制，即使在异构适配器秩的情况下，也能将GPU内存利用率保持在87%以上。我们使用源自Azure Functions跟踪的生产级工作负载对P-LoRA进行评估。实验结果表明，在高并发场景下，P-LoRA实现了比S-LoRA高1.52倍的吞吐量，同时将平均首词生成时间降低了35%。