Anomaly detection on data streams presents significant challenges, requiring methods to maintain high detection accuracy among evolving distributions while ensuring real-time efficiency. Here we introduce $\mathcal{IDK}$-$\mathcal{S}$, a novel $\mathbf{I}$ncremental $\mathbf{D}$istributional $\mathbf{K}$ernel for $\mathbf{S}$treaming anomaly detection that effectively addresses these challenges by creating a new dynamic representation in the kernel mean embedding framework. The superiority of $\mathcal{IDK}$-$\mathcal{S}$ is attributed to two key innovations. First, it inherits the strengths of the Isolation Distributional Kernel, an offline detector that has demonstrated significant performance advantages over foundational methods like Isolation Forest and Local Outlier Factor due to the use of a data-dependent kernel. Second, it adopts a lightweight incremental update mechanism that significantly reduces computational overhead compared to the naive baseline strategy of performing a full model retraining. This is achieved without compromising detection accuracy, a claim supported by its statistical equivalence to the full retrained model. Our extensive experiments on thirteen benchmarks demonstrate that $\mathcal{IDK}$-$\mathcal{S}$ achieves superior detection accuracy while operating substantially faster, in many cases by an order of magnitude, than existing state-of-the-art methods.
翻译:数据流上的异常检测面临重大挑战,要求方法在分布持续演化的同时保持高检测精度,并确保实时效率。本文提出 $\mathcal{IDK}$-$\mathcal{S}$,一种面向流式异常检测的新型$\mathbf{I}$增量$\mathbf{D}$分布$\mathbf{K}$核方法,通过在核均值嵌入框架中构建动态表示,有效应对上述挑战。$\mathcal{IDK}$-$\mathcal{S}$的优越性源于两项关键创新:其一,继承了隔离分布核(一种离线检测器)的优势——该检测器因采用数据依赖核函数,相较于隔离森林、局部离群因子等基础方法已展现出显著性能优势;其二,采用轻量级增量更新机制,相比执行完整模型重训练的朴素基线策略,显著降低了计算开销,且未牺牲检测精度——其与完整重训练模型的统计等价性为此提供了支撑。我们在13个基准数据集上的大量实验表明,$\mathcal{IDK}$-$\mathcal{S}$在实现更优检测精度的同时,运行速度显著超越现有先进方法,多数情况下可达数量级提升。