The performance of modern software systems is critically dependent on their complex configuration options. Building accurate performance models to navigate this vast space requires effective sampling strategies, yet existing methods often struggle with multi-objective optimization and cannot leverage semantic information from documentation. The recent success of Large Language Models (LLMs) motivates the central question of this work: Can LLMs serve as effective samplers for multi-objective performance modeling? To explore this, we present a comprehensive empirical study investigating the capabilities and characteristics of LLM-driven sampling. We design and implement LLM4Perf, a feedback-based framework, and use it to systematically evaluate the LLM-guided sampling process across four highly configurable, real-world systems. Our study reveals that the LLM-guided approach outperforms traditional baselines in most cases. Quantitatively, LLM4Perf achieves the best performance in nearly 68.8% (77 out of 112) of all evaluation scenarios, demonstrating its superior effectiveness. We find this effectiveness stems from the LLM's dual capabilities of configuration space pruning and feedback-driven strategy refinement. The effectiveness of this pruning is further validated by the fact that it also improves the performance of the baseline methods in nearly 91.5% (410 out of 448) of cases. Furthermore, we show how the LLM choices for each component and hyperparameters within LLM4Perf affect its effectiveness. Overall, this paper provides strong evidence for the effectiveness of LLMs in performance engineering and offers concrete insights into the mechanisms that drive their success.
翻译:现代软件系统的性能严重依赖于其复杂的配置选项。为在这一广阔空间中导航而构建准确的性能模型需要有效的采样策略,然而现有方法通常难以应对多目标优化,且无法利用文档中的语义信息。大型语言模型(LLMs)近期的成功引发了本文的核心问题:LLMs能否作为多目标性能建模的有效采样器?为探索此问题,我们开展了一项全面的实证研究,调查LLM驱动采样的能力与特性。我们设计并实现了LLM4Perf,一个基于反馈的框架,并利用该框架在四个高度可配置的真实世界系统中系统评估了LLM引导的采样过程。我们的研究表明,在大多数情况下,LLM引导方法优于传统基线方法。量化而言,LLM4Perf在所有评估场景的近68.8%(112个中的77个)中取得了最佳性能,证明了其卓越的有效性。我们发现这种有效性源于LLM的双重能力:配置空间剪枝和反馈驱动的策略优化。这种剪枝的有效性进一步得到验证,因为它还在近91.5%(448个中的410个)的情况下提升了基线方法的性能。此外,我们展示了LLM4Perf中各组件的LLM选择及超参数如何影响其有效性。总体而言,本文为LLMs在性能工程中的有效性提供了有力证据,并揭示了驱动其成功机制的具体见解。