Dense video captioning is a fine-grained video understanding task that involves two sub-problems: localizing distinct events in a long video stream, and generating captions for the localized events. We propose the Joint Event Detection and Description Network (JEDDi-Net), which solves the dense video captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features, and generates their captions. Proposal features are extracted within each proposal segment through 3D Segment-of-Interest pooling from shared video feature encoding. In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context. On the large-scale ActivityNet Captions dataset, JEDDi-Net demonstrates improved results as measured by standard metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
翻译:高频视频字幕是一项细微的视频理解任务,涉及两个子问题:在长视频流中将不同事件本地化,并为本地化事件制作字幕。我们提议联合事件探测和描述网络(JEDDDI-Net),以端到端的方式解决密集视频字幕任务。我们的模型不断用三维共变相层将输入视频流编码成三维共变相层,根据集合特征提出可变长时间事件,并生成其字幕。提案的功能通过共享视频特征编码的3D Interest 共享集在每一个提案部分中提取。为了在单一视频中明确模拟视觉事件及其字幕之间的时间关系,我们还提议一个两级分级字幕模块,以跟踪背景。关于大型活动网络的数据集,JEDDDDi-Net展示了以标准度测量的更好结果。我们还在TACos-Multilevel数据集上展示了第一个密集字幕结果。