利用选择性 SNP 隐藏,在基因组研究中进行近于最佳的隐私权-效用权衡 (Near-Optimal Privacy-Utility Tradeoff in Genomic Studies Using Selective SNP Hiding)

Motivation: Researchers need a rich trove of genomic datasets that they can leverage to gain a better understanding of the genetic basis of the human genome and identify associations between phenotypes and specific parts of DNA. However, sharing genomic datasets that include sensitive genetic or medical information of individuals can lead to serious privacy-related consequences if data lands in the wrong hands. Restricting access to genomic datasets is one solution, but this greatly reduces their usefulness for research purposes. To allow sharing of genomic datasets while addressing these privacy concerns, several studies propose privacy-preserving mechanisms for data sharing. Differential privacy (DP) is one of such mechanisms that formalize rigorous mathematical foundations to provide privacy guarantees while sharing aggregated statistical information about a dataset. However, it has been shown that the original privacy guarantees of DP-based solutions degrade when there are dependent tuples in the dataset, which is a common scenario for genomic datasets (due to the existence of family members). Results: In this work, we introduce a near-optimal mechanism to mitigate the vulnerabilities of the inference attacks on differentially private query results from genomic datasets including dependent tuples. We propose a utility-maximizing and privacy-preserving approach for sharing statistics by hiding selective SNPs of the family members as they participate in a genomic dataset. By evaluating our mechanism on a real-world genomic dataset, we empirically demonstrate that our proposed mechanism can achieve up to 40% better privacy than state-of-the-art DP-based solutions, while near-optimally minimizing the utility loss.

翻译：动机:研究人员需要丰富的基因组数据集,以便他们能够利用这些数据来更好地了解人类基因组的基因基础,并查明苯型和DNA具体部分之间的关联。然而,共享包含个人敏感基因或医疗信息的基因组数据集,如果数据落地不法,可能导致严重的隐私相关后果。限制对基因组数据集的访问是一个解决办法,但这大大降低了其对研究目的的用处。为了在解决这些隐私问题的同时共享基因组数据集,一些研究提议了数据共享的隐私保存机制。不同隐私(DP)是正式确定严格数学基础的一种机制,以提供隐私保障,同时共享有关数据集的汇总统计资料。但是,已经表明,如果数据集中存在依赖图腾时,基于基因组的解决方案的原始隐私保障会降低与隐私有关的后果。限制基因组数据集(由于家庭成员存在解决方案)的常见假设。结果:在这项工作中,我们引入了一种近于最佳的机械化机制,以降低数据共享隐私机制的弱点。我们通过在线的系统化数据存储系统,我们通过在线的估算,可以实现一个在线数据存储模式的软性数据采集结果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

INRIA最新「机器学习理论」新书，229页pdf原理性阐述机器学习

专知会员服务

69+阅读 · 2021年3月27日

网络表示学习算法综述

专知会员服务

66+阅读 · 2020年9月24日

基于知识图谱的推荐系统研究综述

专知会员服务

332+阅读 · 2020年8月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日