人类流动数据数据中的两点影响流行病建模 (Biases in human mobility data impact epidemic modeling)

Large-scale human mobility data is a key resource in data-driven policy making and across many scientific fields. Most recently, mobility data was extensively used during the COVID-19 pandemic to study the effects of governmental policies and to inform epidemic models. Large-scale mobility is often measured using digital tools such as mobile phones. However, it remains an open question how truthfully these digital proxies represent the actual travel behavior of the general population. Here, we examine mobility datasets from multiple countries and identify two fundamentally different types of bias caused by unequal access to, and unequal usage of mobile phones. We introduce the concept of data generation bias, a previously overlooked type of bias, which is present when the amount of data that an individual produces influences their representation in the dataset. We find evidence for data generation bias in all examined datasets in that high-wealth individuals are overrepresented, with the richest 20% contributing over 50% of all recorded trips, substantially skewing the datasets. This inequality is consequential, as we find mobility patterns of different wealth groups to be structurally different, where the mobility networks of high-wealth users are denser and contain more long-range connections. To mitigate the skew, we present a framework to debias data and show how simple techniques can be used to increase representativeness. Using our approach we show how biases can severely impact outcomes of dynamic processes such as epidemic simulations, where biased data incorrectly estimates the severity and speed of disease transmission. Overall, we show that a failure to account for biases can have detrimental effects on the results of studies and urge researchers and practitioners to account for data-fairness in all future studies of human mobility.

翻译：最近,在COVID-19大流行期间,流动数据被广泛用于研究政府政策的影响和向流行病模型提供信息。大规模流动往往使用移动电话等数字工具来衡量。然而,这些数字代理人真实地代表了一般人口的实际旅行行为,这仍然是一个未决问题。在这里,我们检查了来自多个国家的流动数据集,并查明了两种完全不同的偏见类型,这些偏见是由不能平等获得和使用移动电话造成的。我们引入了数据生成偏差的概念,一种先前被忽视的偏差类型,当一个人产生的数据数量影响其在数据集中的代表性时,就存在这种偏差。我们在所审查的所有高湿度个人的数据中都发现了数据生成偏差的证据,而最富有的20%的人在所有记录的旅行中贡献了50%以上的实际旅行,大大扭曲了数据设置。这种不平等是必然的,因为我们发现不同财富群体的流动模式在结构上各不相同,高湿度用户的流动网络具有不稳度和不稳度的影响,而我们使用这种不稳度的计算方法来显示我们如何以不稳度方式进行不稳度的流动性和不稳度统计结果。我们用这种方法来显示一种不稳度分析,我们如何降低流动性和不稳度的周期性数据。我们如何以显示我们如何使用不稳度分析结果。