Large Language Models (LLMs), such as GPT and LLaMA, introduce unique memory access characteristics during inference due to frequent token sequence lookups and embedding vector retrievals. These workloads generate highly irregular and bursty access patterns, causing traditional prefetching and replacement policies to mispredict and trigger severe cache pollution, thereby degrading system performance. To address this challenge, this paper proposes an Adaptive Cache Pollution Control (ACPC) mechanism tailored for LLM inference workloads, integrating Temporal Convolutional Network (TCN)-based access prediction with a priority-aware replacement strategy. The TCN module learns temporal dependencies in token access sequences to identify potential high-reuse cache lines, while the replacement policy dynamically adjusts eviction priorities based on predicted reuse likelihood and cache occupancy. The proposed framework is implemented and evaluated on representative transformer-based inference traces, including GPT-style autoregressive decoding and embedding retrieval workloads. Experimental results demonstrate that ACPC reduces cache pollution by 41.7 percent, improves cache hit rate by 8.9 percent, and achieves a 60.0 percent reduction in L2 miss penalty, compared with state-of-the-art machine-learning-based replacement baselines. Additionally, the proposed Temporal CNN-based ACPC framework increases token generation throughput by 15.9 percent and achieves the lowest final loss of 0.21, confirming its superior efficiency and stability under complex LLM inference workloads. These results highlight ACPC's effectiveness in recognizing useful cache lines and mitigating redundant prefetches under dynamic LLM access behaviors. The proposed approach provides a scalable, learning-driven solution for optimizing memory efficiency and latency in large-scale LLM serving and inference systems.
翻译:以GPT和LLaMA为代表的大语言模型(LLMs)在推理过程中,由于频繁的令牌序列查找和嵌入向量检索,呈现出独特的内存访问特征。这些工作负载产生高度不规则且突发性的访问模式,导致传统的预取和替换策略误判,引发严重的缓存污染,从而降低系统性能。为应对这一挑战,本文提出一种专为LLM推理工作负载设计的自适应缓存污染控制(ACPC)机制,该机制将基于时序卷积网络(TCN)的访问预测与优先级感知替换策略相结合。TCN模块学习令牌访问序列中的时序依赖关系,以识别潜在高重用率的缓存行;而替换策略则根据预测的重用可能性和缓存占用情况动态调整淘汰优先级。所提出的框架在具有代表性的基于Transformer的推理轨迹上实现并评估,包括GPT风格的自回归解码和嵌入检索工作负载。实验结果表明,与最先进的基于机器学习的替换基线相比,ACPC将缓存污染降低了41.7%,缓存命中率提高了8.9%,并实现了60.0%的L2缺失惩罚减少。此外,所提出的基于时序卷积网络的ACPC框架将令牌生成吞吐量提高了15.9%,并达到最低的最终损失值0.21,证实了其在复杂LLM推理工作负载下的卓越效率和稳定性。这些结果突显了ACPC在识别有用缓存行并减轻动态LLM访问行为下冗余预取方面的有效性。所提出的方法为优化大规模LLM服务与推理系统的内存效率和延迟提供了一种可扩展、学习驱动的解决方案。