分布式逻辑回归用于稀有事件的大规模数据 (Distributed Logistic Regression for Massive Data with Rare Events)

Large-scale rare events data are commonly encountered in practice. To tackle the massive rare events data, we propose a novel distributed estimation method for logistic regression in a distributed system. For a distributed framework, we face the following two challenges. The first challenge is how to distribute the data. In this regard, two different distribution strategies (i.e., the RANDOM strategy and the COPY strategy) are investigated. The second challenge is how to select an appropriate type of objective function so that the best asymptotic efficiency can be achieved. Then, the under-sampled (US) and inverse probability weighted (IPW) types of objective functions are considered. Our results suggest that the COPY strategy together with the IPW objective function is the best solution for distributed logistic regression with rare events. The finite sample performance of the distributed methods is demonstrated by simulation studies and a real-world Sweden Traffic Sign dataset.

翻译：大规模稀有事件数据在实践中经常遇到。为了解决海量稀有事件数据, 我们提出了一种新的逻辑回归分布式估计方法，在分布式系统中进行逻辑回归估计。对于分布式框架, 我们面临以下两个挑战。第一个挑战是如何分配数据。为此, 我们研究了两种不同的分配策略（即RANDOM策略和COPY策略）。第二个挑战是如何选择适当类型的目标函数以实现最佳的渐近效率。然后我们考虑了径向基函数（US）和倒数概率加权（IPW）类型的目标函数。我们的结果表明，COPY策略与IPW目标函数结合在稀有事件的分布式逻辑回归中是最好的解决方案。通过模拟研究和现实世界的瑞典交通标志数据集，演示了分布式方法的有限样本性能。

相关内容

逻辑回归

关注 315

逻辑回归（也称“对数几率回归”）（英语：Logistic regression 或logit regression），即逻辑模型（英语：Logit model，也译作“评定模型”、“分类评定模型”）是离散选择法模型之一，属于多重变量分析范畴，是社会学、生物统计学、临床、数量心理学、计量经济学、市场营销等统计实证分析的常用方法。在统计学中，logistic模型(或logit模型)用于对存在的某个类或事件的概率建模，例如通过/失败、赢/输、活着/死了或健康/生病。这可以扩展到建模若干类事件，如确定一个图像是否包含猫、狗、狮子等。图像中检测到的每个物体的概率都在0到1之间，其和为1。

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

125+阅读 · 2023年1月29日

【干货书】工程和科学中的概率和统计，

专知会员服务

57+阅读 · 2022年12月24日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

70+阅读 · 2022年6月28日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

50+阅读 · 2020年12月14日