Multimodal large language models (MLLMs) are built on text-only LLMs by incorporating additional modalities, enabling multimodal understanding and a broader range of applications. However, these additions introduce a previously unexplored energy trade-off across modalities that remains poorly understood, as most prior work focuses on text-only models. In this paper, we examine modality inflation, a key source of inefficiency in which multimodal inputs increase inference workloads through extra encoding stages and expanded token sequences. We provide the first detailed, stage-level analysis of energy consumption in MLLM inference by breaking the pipeline into vision encoding, prefill, and decoding stages. Using four representative MLLMs evaluated on NVIDIA A100 GPU, we quantify the additional energy required for multimodal inference compared to text-only baselines, observing overheads ranging from 17% to 94% across models for identical inputs. Our results show that energy bottlenecks differ widely across model architectures, stemming either from compute-heavy vision encoders or from the downstream impact of large visual token sequences during prefill. By examining GPU power traces, we further uncover substantial GPU underutilization during multimodal execution and show that input complexity leads to markedly different energy scaling behaviors across models. Finally, we demonstrate that stage-wise dynamic voltage and frequency scaling (DVFS) is an effective optimization, allowing energy savings with only modest performance impact. Together, these findings offer practical insights and concrete guidance for designing more energy-efficient multimodal LLM serving systems.
翻译:多模态大语言模型(MLLMs)通过在纯文本LLMs基础上融入额外模态而构建,实现了多模态理解并拓宽了应用范围。然而,这些新增模态引入了一种先前未被探索的跨模态能耗权衡,目前对其理解仍很有限,因为大多数先前工作主要关注纯文本模型。本文研究了模态膨胀这一低效性的关键来源,即多模态输入通过额外的编码阶段和扩展的令牌序列增加了推理工作负载。我们将MLLM推理流程分解为视觉编码、预填充和解码三个阶段,首次提供了阶段级的详细能耗分析。通过在NVIDIA A100 GPU上评估四种代表性MLLM,我们量化了多模态推理相较于纯文本基线所需的额外能耗,观察到对于相同输入,不同模型的能耗开销在17%至94%之间。我们的结果表明,不同模型架构的能耗瓶颈差异显著,这主要源于计算密集的视觉编码器,或是预填充阶段大量视觉令牌序列产生的下游影响。通过分析GPU功耗轨迹,我们进一步揭示了多模态执行期间GPU存在显著利用率不足的现象,并表明输入复杂性导致不同模型间呈现出明显差异的能耗扩展行为。最后,我们证明了分阶段动态电压频率调节(DVFS)是一种有效的优化策略,能够在仅引入适度性能影响的前提下实现节能。综上,这些发现为设计更高能效的多模态LLM服务系统提供了实用见解和具体指导。