Recent advances in natural language processing have enabled the increasing use of text data in causal inference, particularly for adjusting confounding factors in treatment effect estimation. Although high-dimensional text can encode rich contextual information, it also poses unique challenges for causal identification and estimation. In particular, the positivity assumption, which requires sufficient treatment overlap across confounder values, is often violated at the observational level, when massive text is represented in feature spaces. Redundant or spurious textual features inflate dimensionality, producing extreme propensity scores, unstable weights, and inflated variance in effect estimates. We address these challenges with Confounding-Aware Token Rationalization (CATR), a framework that selects a sparse necessary subset of tokens using a residual-independence diagnostic designed to preserve confounding information sufficient for unconfoundedness. By discarding irrelevant texts while retaining key signals, CATR mitigates observational-level positivity violations and stabilizes downstream causal effect estimators. Experiments on synthetic data and a real-world study using the MIMIC-III database demonstrate that CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.
翻译:自然语言处理的最新进展使得文本数据在因果推断中的应用日益增多,特别是在处理效应估计中用于调整混杂因素方面。尽管高维文本能够编码丰富的上下文信息,但它也为因果识别和估计带来了独特的挑战。具体而言,在观测层面,当海量文本在特征空间中表示时,通常违反正性假设,该假设要求混杂因素取值上存在充分的处理重叠。冗余或虚假的文本特征会膨胀维度,导致极端倾向得分、不稳定的权重以及效应估计中的方差膨胀。我们通过混杂感知的标记合理化框架来解决这些挑战,该框架使用旨在保留足以满足无混杂性的混杂信息的残差独立性诊断,选择稀疏的必要标记子集。通过丢弃无关文本同时保留关键信号,CATR缓解了观测层面的正性假设违反问题,并稳定了下游因果效应估计器。在合成数据和基于MIMIC-III数据库的真实世界研究中的实验表明,与现有基线相比,CATR能够产生更准确、稳定且可解释的因果效应估计。