Simpson's paradox, a long-standing statistical phenomenon, describes the reversal of an observed association when data are disaggregated into sub-populations. It has critical implications across statistics, epidemiology, economics, and causal inference. Existing methods for detecting Simpson's paradox overlook a key issue: many paradoxes are redundant, arising from equivalent selections of data subsets, identical partitioning of sub-populations, and correlated outcome variables, which obscure essential patterns and inflate computational cost. In this paper, we present the first framework for discovering non-redundant Simpson's paradoxes. We formalize three types of redundancy - sibling child, separator, and statistic equivalence - and show that redundancy forms an equivalence relation. Leveraging this insight, we propose a concise representation framework for systematically organizing redundant paradoxes and design efficient algorithms that integrate depth-first materialization of the base table with redundancy-aware paradox discovery. Experiments on real-world datasets and synthetic benchmarks show that redundant paradoxes are widespread, on some real datasets constituting over 40% of all paradoxes, while our algorithms scale to millions of records, reduce run time by up to 60%, and discover paradoxes that are structurally robust under data perturbation. These results demonstrate that Simpson's paradoxes can be efficiently identified, concisely summarized, and meaningfully interpreted in large multidimensional datasets.
翻译:辛普森悖论作为一种长期存在的统计现象,描述了当数据被分解为子群体时观察到的关联发生逆转的情况。该悖论在统计学、流行病学、经济学和因果推断领域具有重要影响。现有检测辛普森悖论的方法忽略了一个关键问题:许多悖论是冗余的,这些冗余源于数据子集的等效选择、子群体的相同划分以及相关的结果变量,从而掩盖了本质模式并增加了计算成本。本文提出了首个发现非冗余辛普森悖论的框架。我们形式化了三种冗余类型——兄弟子节点、分隔符和统计等价——并证明冗余构成一种等价关系。基于这一见解,我们提出了一个简洁的表示框架来系统组织冗余悖论,并设计了高效算法,该算法将基表的深度优先物化与冗余感知的悖论发现相结合。在真实世界数据集和合成基准上的实验表明,冗余悖论普遍存在,在某些真实数据集中占所有悖论的40%以上,而我们的算法可扩展至数百万条记录,运行时间减少高达60%,并能发现结构上对数据扰动具有鲁棒性的悖论。这些结果表明,辛普森悖论可以在大型多维数据集中被高效识别、简洁总结并得到有意义的解释。