Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from latent confounders, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from selection--only cells satisfying certain survival or inclusion criteria are observed--while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are symmetric under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.
翻译:基因调控网络推断旨在从基因表达数据中发现基因间因果调控关系。众所周知,观测数据中的统计依赖性未必意味着因果关系,因为虚假依赖性可能源自潜在混杂因素(如非编码RNA)。为此,已有大量GRNI方法被提出以应对混杂问题。然而,依赖性也可能源于选择过程——只有满足特定存活或纳入标准的细胞被观测到——而这种选择诱导的虚假依赖性在基因表达数据分析中常被忽视。本研究表明,此类选择现象普遍存在,若被忽略或与真实调控关系混淆,将导致因果解释错误及干预建议失准。为应对这一挑战,一个根本性问题随之产生:我们能否区分由调控、混杂以及关键性的选择所导致的依赖性?我们证明,基因扰动为此提供了简洁而有效的解答:选择诱导的依赖性在扰动下具有对称性,而调控或混杂产生的依赖性则不具备。基于此动机,我们提出了GISL(存在选择偏倚与潜在混杂因素下的基因调控网络推断),这是一种利用扰动数据揭示真实基因调控关系以及选择与混杂的非调控机制(直至等价类)的原则性算法。在合成与真实基因表达数据上的实验验证了本方法的有效性。