In a variety of settings, limitations of sensing technologies or other sampling mechanisms result in missing labels, where the likelihood of a missing label in the training set is an unknown function of the data. For example, satellites used to detect forest fires cannot sense fires below a certain size threshold. In such cases, training datasets consist of positive and pseudo-negative observations where pseudo-negative observations can be either true negatives or undetected positives with small magnitudes. We develop a new methodology and non-convex algorithm P(ositive) U(nlabeled) - O(ccurrence) M(agnitude) M(ixture) which jointly estimates the occurrence and detection likelihood of positive samples, utilizing prior knowledge of the detection mechanism. Our approach uses ideas from positive-unlabeled (PU)-learning and zero-inflated models that jointly estimate the magnitude and occurrence of events. We provide conditions under which our model is identifiable and prove that even though our approach leads to a non-convex objective, any local minimizer has optimal statistical error (up to a log term) and projected gradient descent has geometric convergence rates. We demonstrate on both synthetic data and a California wildfire dataset that our method out-performs existing state-of-the-art approaches.
翻译:在各种环境下,遥感技术或其他取样机制的局限性导致标签缺失,培训数据集中丢失标签的可能性是数据的一个未知功能。例如,用于探测森林火灾的卫星无法感知低于一定尺寸阈值的火灾。在这种情况下,培训数据集包括正面和伪反向观测,其中假阴性的观测可以是真实的负数,也可以是微量的未检测的正数。我们开发了新方法和非对等算法P(ositive) U(标签)U(O(curence)M(度)M(九度),其中利用事先对探测机制的了解,共同估计阳性样品的发生和探测可能性。我们的方法使用了正无标记(PU)学习和零膨胀模型的想法,共同估计事件的规模和发生情况。我们提供了可以识别和证明我们模型的条件,即使我们的方法导致非科韦克斯目的,但任何局部最小化的统计错误(直至一个日志术语)和预测的梯度梯度梯度下降率都高于我们现有的加利福尼亚州的合成数据和测深法。我们用正态数据展示了一种状态。