Existing implicit neural representation (INR) methods do not fully exploit spatiotemporal redundancies in videos. Index-based INRs ignore the content-specific spatial features and hybrid INRs ignore the contextual dependency on adjacent frames, leading to poor modeling capability for scenes with large motion or dynamics. We analyze this limitation from the perspective of function fitting and reveal the importance of frame difference. To use explicit motion information, we propose Difference Neural Representation for Videos (DNeRV), which consists of two streams for content and frame difference. We also introduce a collaborative content unit for effective feature fusion. We test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves competitive results against the state-of-the-art neural compression approaches and outperforms existing implicit methods on downstream inpainting and interpolation for $960 \times 1920$ videos.
翻译:现有的隐式神经表示方法不能充分利用视频中的时空冗余。基于索引的隐式神经表示方法忽略了特定内容的空间特征,而混合隐式神经表示方法忽略了对相邻帧的上下文依赖性,导致对大运动或动态场景的建模能力较弱。我们从函数拟合的角度分析了这种局限性,并揭示了帧差异的重要性。为利用显式的运动信息,我们提出了差分神经表示视频模型 (DNeRV),该模型由内容和帧差异两个流组成。我们还引入了协作内容单元以有效地融合特征。我们测试了 DNeRV 用于视频压缩、修补和插值。DNeRV 在 $960 \times 1920$ 视频的下游修补和插值任务中击败了现有的隐式方法,取得了与现有神经压缩方法相媲美的结果。