智能但昂贵？评估大语言模型在功能准确性与能源效率方面的表现 (Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency)

The rapid advancement of AI technologies and their accelerated adoption in software development necessitates a systematic evaluation of their environmental impact alongside functional correctness. While prior studies have examined sustainability in large language models, existing approaches lack systematic frameworks for evaluating accuracy-energy trade-offs in Code Language Models (CLMs). In this paper, we present a framework, BRACE, to benchmark CLMs on a unified scale of energy efficiency and functional correctness (referred to as accuracy). We benchmark 22 state-of-the-art models on code generation and summarization tasks, proposing two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER). CIRC provides deterministic Euclidean-based rankings with static trade-offs that are robust to outliers, and OTER offers trend-aware evaluation with dynamic trade-offs that capture the complex correlation between energy and accuracy, each offering a distinct perspective and addressing the problem in a unique way. These rating methods enable us to rate LLMs on a 1-5 scale reflecting their combined capabilities in terms of energy efficiency and functional correctness. Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output. Also, we find that models' size does not have a significant impact on their ratings, indicating that if models utilize their parameters efficiently, they can be ranked higher on these scales. The proposed BRACE framework empowers practitioners to make evidence-based model selections that balance sustainability with task requirements, guiding rating choice -- CIRC for deterministic comparisons or OTER for trend-aware evaluation -- based on deployment priorities.

翻译：人工智能技术的快速进步及其在软件开发中的加速应用，要求我们对其环境影响与功能正确性进行系统性评估。尽管先前的研究已探讨了大语言模型的可持续性问题，但现有方法缺乏评估代码语言模型（CLMs）中准确性与能耗权衡的系统性框架。本文提出了一个框架——BRACE，用于在能源效率与功能正确性（简称准确性）的统一尺度上对CLMs进行基准测试。我们在代码生成与代码摘要任务上对22个最先进的模型进行了基准测试，并提出了两种评级方法：同心增量评级环（CIRC）与观测期望评级（OTER）。CIRC提供基于欧几里得距离的确定性排名，具有静态权衡且对异常值稳健；OTER则提供趋势感知评估，通过动态权衡捕捉能耗与准确性之间的复杂相关性。这两种评级方法各有独特视角，以不同方式解决问题，使我们能够以1-5分的尺度对LLMs进行评级，反映其在能源效率与功能正确性方面的综合能力。我们的分析表明，模型通常在代码摘要任务中表现更好，因为它们无需强制生成基于语法且句法正确的输出。此外，我们发现模型的大小对其评级没有显著影响，这表明如果模型能高效利用其参数，它们可以在这些评级尺度上获得更高排名。提出的BRACE框架使从业者能够基于证据进行模型选择，在可持续性与任务需求之间取得平衡，并根据部署优先级指导评级方法的选择——CIRC用于确定性比较，或OTER用于趋势感知评估。