Causal reasoning in Large Language Models spanning association, intervention, and counterfactual inference is essential for reliable decision making in high stakes settings. As deployment shifts toward edge and resource constrained environments, quantized models such as INT8 and NF4 are becoming standard. Yet the impact of precision reduction on formal causal reasoning is poorly understood. To our knowledge, this is the first study to systematically evaluate quantization effects across all three levels of Pearls Causal Ladder. Using a 3000 sample stratified CLadder benchmark, we find that rung level accuracy in Llama 3 8B remains broadly stable under quantization, with NF4 showing less than one percent overall degradation. Interventional queries at rung 2 are the most sensitive to precision loss, whereas counterfactual reasoning at rung 3 is comparatively stable but exhibits heterogeneous weaknesses across query types such as collider bias and backdoor adjustment. Experiments on the CRASS benchmark show near identical performance across precisions, indicating that existing commonsense counterfactual datasets lack the structural sensitivity needed to reveal quantization induced reasoning drift. We further evaluate Graph Retrieval Augmented Generation using ground truth causal graphs and observe a consistent improvement in NF4 interventional accuracy of plus 1.7 percent, partially offsetting compression related degradation. These results suggest that causal reasoning is unexpectedly robust to four bit quantization, graph structured augmentation can selectively reinforce interventional reasoning, and current counterfactual benchmarks fail to capture deeper causal brittleness. This work provides an initial empirical map of compressed causal reasoning and practical guidance for deploying efficient and structurally supported causal AI systems.
翻译:大型语言模型中涵盖关联、干预和反事实推断的因果推理对于高风险场景中的可靠决策至关重要。随着部署向边缘及资源受限环境转移,INT8和NF4等量化模型正成为标准配置。然而,精度缩减对形式化因果推理的影响尚不明确。据我们所知,这是首个系统评估量化在Pearl因果阶梯所有三个层级上影响的研究。通过使用包含3000个样本的分层CLadder基准测试,我们发现Llama 3 8B模型在量化条件下各阶梯层级的准确率总体保持稳定,其中NF4量化整体性能下降小于百分之一。第二层级的干预查询对精度损失最为敏感,而第三层级的反事实推理相对稳定,但在碰撞偏倚和后门调整等查询类型中表现出异质性弱点。在CRASS基准上的实验显示不同精度间性能近乎一致,表明现有常识反事实数据集缺乏揭示量化引发推理漂移所需的结构敏感性。我们进一步使用真实因果图评估图检索增强生成技术,观察到NF4量化模型的干预准确率持续提升1.7%,部分抵消了压缩相关的性能衰减。这些结果表明:因果推理对四比特量化具有超预期的鲁棒性;图结构增强能选择性地强化干预推理;当前反事实基准未能捕捉更深层的因果脆弱性。本研究绘制了压缩因果推理的首个实证图谱,为部署高效且具结构支持的因果人工智能系统提供了实践指导。