Under the environment of big data streams, it is a common situation where the variable set of a model may change according to the condition of data streams. In this paper, we propose a homogenization strategy to represent the heterogenous models that are gradually updated in the process of data streams. With the homogenized representations, we can easily construct various online updating statistics such as parameter estimation, residual sum of squares and $F$-statistic for the heterogenous updating regression models. The main difference from the classical scenarios is that the artificial covariates in the homogenized models are not identically distributed as the natural covariates in the original models, consequently, the related theoretical properties are distinct from the classical ones. The asymptotical properties of the online updating statistics are established, which show that the new method can achieve estimation efficiency and oracle property, without any constraint on the number of data batches. The behavior of the method is further illustrated by various numerical examples from simulation experiments.
翻译:在大数据流环境中,一种常见的情况是,一个模型的变数组可能根据数据流的条件而改变。在本文中,我们提出一个同质化战略,以代表在数据流过程中逐步更新的异种模型。有了同质化的表示,我们可以很容易地建立各种在线更新统计数据,如参数估计、方块剩余和异源更新回归模型的美元统计。与古典假设的主要区别是,同质化模型中的人工共变与原始模型中的自然共变不完全相同,因此,相关的理论属性与古典不同。在线更新统计数据的无症状特性已经确立,表明新的方法可以实现估计效率和属性,而不会限制数据批量的数量。模拟实验中的各种数字实例进一步说明了该方法的行为。