As military organisations consider integrating large language models (LLMs) into command and control (C2) systems for planning and decision support, understanding their behavioural tendencies is critical. This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour by comparing LLMs acting as agents in multi-turn simulated conflict. We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine: Civilian Target Rate (CTR) and Dual-use Target Rate (DTR) assess compliance with legal targeting principles, while Mean and Max Simulated Non-combatant Casualty Value (SNCV) quantify tolerance for civilian harm. We evaluate three frontier models, GPT-4o, Gemini-2.5, and LLaMA-3.1, through 90 multi-agent, multi-turn crisis simulations across three geographic regions. Our findings reveal that off-the-shelf LLMs exhibit concerning and unpredictable targeting behaviour in simulated conflict environments. All models violated the IHL principle of distinction by targeting civilian objects, with breach rates ranging from 16.7% to 66.7%. Harm tolerance escalated through crisis simulations with MeanSNCV increasing from 16.5 in early turns to 27.7 in late turns. Significant inter-model variation emerged: LLaMA-3.1 selected an average of 3.47 civilian strikes per simulation with MeanSNCV of 28.4, while Gemini-2.5 selected 0.90 civilian strikes with MeanSNCV of 17.6. These differences indicate that model selection for deployment constitutes a choice about acceptable legal and moral risk profiles in military operations. This work seeks to provide a proof-of-concept of potential behavioural risks that could emerge from the use of LLMs in Decision Support Systems (AI DSS) as well as a reproducible benchmarking framework with interpretable metrics for standardising pre-deployment testing.
翻译:随着军事组织考虑将大语言模型(LLMs)集成至指挥与控制(C2)系统中以提供规划与决策支持,理解其行为倾向变得至关重要。本研究开发了一个基准测试框架,通过比较在多轮次模拟冲突中充当智能体的大语言模型,评估其在目标选择行为中涉及的法律与道德风险。我们引入了基于国际人道法(IHL)和军事学说的四个指标:民用目标率(CTR)和军民两用目标率(DTR)用于评估对合法目标选择原则的遵守情况,而平均与最大模拟非战斗人员伤亡值(SNCV)则量化了对平民伤害的容忍度。我们通过涵盖三个地理区域的90次多智能体、多轮次危机模拟,评估了三个前沿模型:GPT-4o、Gemini-2.5和LLaMA-3.1。我们的研究结果表明,现成的大语言模型在模拟冲突环境中表现出令人担忧且不可预测的目标选择行为。所有模型均因攻击民用目标而违反了国际人道法的区分原则,违规率在16.7%至66.7%之间。在危机模拟过程中,伤害容忍度逐步升级,平均SNCV从早期轮次的16.5上升至后期轮次的27.7。模型间存在显著差异:LLaMA-3.1在每次模拟中平均选择3.47次针对平民的打击,平均SNCV为28.4;而Gemini-2.5平均选择0.90次针对平民的打击,平均SNCV为17.6。这些差异表明,为部署选择何种模型,实质上是对军事行动中可接受的法律与道德风险水平的一种抉择。本研究旨在为决策支持系统(AI DSS)中使用大语言模型可能引发的潜在行为风险提供一个概念验证,同时提供一个具有可解释性指标的、可复现的基准测试框架,以标准化部署前的测试。