The big-data revolution announced ten years ago does not seem to have fully happened at the expected scale. One of the main obstacle to this, has been the lack of data circulation. And one of the many reasons people and organizations did not share as much as expected is the privacy risk associated with data sharing operations. There has been many works on practical systems to compute statistical queries with Differential Privacy (DP). There have also been practical implementations of systems to train Neural Networks with DP, but relatively little efforts have been dedicated to designing scalable classical Machine Learning (ML) models providing DP guarantees. In this work we describe and implement a DP fork of a battle tested ML model: XGBoost. Our approach beats by a large margin previous attempts at the task, in terms of accuracy achieved for a given privacy budget. It is also the only DP implementation of boosted trees that scales to big data and can run in distributed environments such as: Kubernetes, Dask or Apache Spark.
翻译:十年前宣布的大数据革命似乎没有在预期的规模上完全发生。 造成这一变化的主要障碍之一是缺乏数据流通。 众多原因之一是与数据共享操作有关的隐私风险没有达到预期的程度。 许多关于用差异隐私(DP)计算统计查询的实用系统的工作。 还实际实施了与DP一起培训神经网络的系统,但是在设计可缩放的经典机器学习模型以提供DP保障方面所做的努力相对较少。 在这项工作中,我们描述并实施了经过战斗测试的ML模型的DP叉: XGBoost。我们的方法比先前在特定隐私预算方面实现的准确性大受任务尝试的打击。 这也是在大型数据、 Dask 或 Apache Spark 等分布式环境中, 唯一能实施强化树的DP 。