The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which is a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, focusing on continuous traits, we present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based GxE tests, to permit GxE VC tests for biobank-scale data. SEAGLE employs modern matrix computations to achieve the same "exact" results as the original GxE VC tests, and does not impose additional assumptions nor relies on approximations. SEAGLE can easily accommodate sample sizes in the order of $10^5$, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based GxE analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.
翻译:生物库数据的爆炸为基因-环境(GxE)对复杂疾病进行互动研究提供了即时机会,因为样本规模大,遗传和非遗传信息收集量丰富。然而,极大样本规模也给GxE评估带来了新的计算挑战,特别是用于基于定点的GxE差异部分(VC)测试,这是广泛使用的一种战略,用以提升GxE总体信号,并评价具有生物意义单位(例如基因)多种变体的GxE联合效应。在这项工作中,以连续特性为重点,我们提出了SEAGLE,一个基于GxE的大规模成套GxE测试可缩放的Exact AlGorithm,以允许GxE测试生物库数据。SEGLE采用现代矩阵计算方法实现与原GxE VC测试相同的“精确”结果,不强加额外假设,也不依赖近似值。 SEAGLE可以很容易地适应10-5美元的样本大小,在标准笔记本电脑上可以执行,并且不要求对大规模基于Gx的Gx测试进行专门的物理分析,我们用SE-BI 进行广泛的Bealexex 实验活动。我们用SEAGEAUAUD 展示了其全局的全局进行模拟的模拟的BE。