The effectiveness of self-supervised learning (SSL) for physiological time series depends on the ability of a pretraining objective to preserve information about the underlying physiological state while filtering out unrelated noise. However, existing strategies are limited due to reliance on heuristic principles or poorly constrained generative tasks. To address this limitation, we propose a pretraining framework that exploits the information structure of a dynamical systems generative model across multiple time-series. This framework reveals our key insight that class identity can be efficiently captured by extracting information about the generative variables related to the system parameters shared across similar time series samples, while noise unique to individual samples should be discarded. Building on this insight, we propose PULSE, a cross-reconstruction-based pretraining objective for physiological time series datasets that explicitly extracts system information while discarding non-transferrable sample-specific ones. We establish theory that provides sufficient conditions for the system information to be recovered, and empirically validate it using a synthetic dynamical systems experiment. Furthermore, we apply our method to diverse real-world datasets, demonstrating that PULSE learns representations that can broadly distinguish semantic classes, increase label efficiency, and improve transfer learning.
翻译:自监督学习在生理时间序列中的有效性,取决于预训练目标能否在过滤无关噪声的同时,保留关于底层生理状态的信息。然而,现有策略因依赖启发式原则或约束不足的生成任务而存在局限。为应对这一局限,我们提出一种预训练框架,该框架利用动态系统生成模型在多个时间序列上的信息结构。该框架揭示了我们的核心见解:类别身份可通过提取与相似时间序列样本间共享的系统参数相关的生成变量信息来高效捕获,而样本特有的噪声则应被丢弃。基于这一见解,我们提出PULSE——一种针对生理时间序列数据集的基于交叉重构的预训练目标,其显式提取系统信息并丢弃不可迁移的样本特异性信息。我们建立了理论,为系统信息的可恢复性提供充分条件,并通过合成动态系统实验进行实证验证。此外,我们将该方法应用于多样化的真实世界数据集,证明PULSE学习的表征能够广泛区分语义类别、提升标签效率并改进迁移学习性能。