用于基因组空间重构的距离保持表示 (Distance-Preserving Representations for Genomic Spatial Reconstruction)

The spatial context of single-cell gene expression data is crucial for many downstream analyses, yet often remains inaccessible due to practical and technical limitations, restricting the utility of such datasets. In this paper, we propose a generic representation learning and transfer learning framework dp-VAE, capable of reconstructing the spatial coordinates associated with the provided gene expression data. Central to our approach is a distance-preserving regularizer integrated into the loss function during training, ensuring the model effectively captures and utilizes spatial context signals from reference datasets. During the inference stage, the produced latent representation of the model can be used to reconstruct or impute the spatial context of the provided gene expression by solving a constrained optimization problem. We also explore the theoretical connections between distance-preserving loss, distortion, and the bi-Lipschitz condition within generative models. Finally, we demonstrate the effectiveness of dp-VAE in different tasks involving training robustness, out-of-sample evaluation, and transfer learning inference applications by testing it over 27 publicly available datasets. This underscores its applicability to a wide range of genomics studies that were previously hindered by the absence of spatial data.

翻译：单细胞基因表达数据的空间背景对于许多下游分析至关重要，但由于实际和技术限制，这些背景信息往往难以获取，从而限制了此类数据集的实用性。本文提出了一种通用的表示学习与迁移学习框架dp-VAE，能够重构与给定基因表达数据相关联的空间坐标。我们方法的核心是在训练过程中将距离保持正则化项整合到损失函数中，确保模型有效捕获并利用参考数据集中的空间背景信号。在推理阶段，模型生成的潜在表示可用于通过求解约束优化问题来重构或填补给定基因表达的空间背景。我们还探讨了生成模型中距离保持损失、失真度与双Lipschitz条件之间的理论联系。最后，通过在27个公开数据集上进行测试，我们验证了dp-VAE在训练鲁棒性、样本外评估及迁移学习推理应用等不同任务中的有效性，这凸显了其对于先前因缺乏空间数据而受限的广泛基因组学研究的适用性。