双存储聚合网络用于可学习表示的基于事件的物体检测 (Dual Memory Aggregation Network for Event-Based Object Detection with Learnable Representation)

Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner. Compared with frame-based sensors, event cameras have microsecond-level latency and high dynamic range, hence showing great potential for object detection under high-speed motion and poor illumination conditions. Due to sparsity and asynchronism nature with event streams, most of existing approaches resort to hand-crafted methods to convert event data into 2D grid representation. However, they are sub-optimal in aggregating information from event stream for object detection. In this work, we propose to learn an event representation optimized for event-based object detection. Specifically, event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation. To fully exploit information with event streams to detect objects, a dual-memory aggregation network (DMANet) is proposed to leverage both long and short memory along event streams to aggregate effective information for object detection. Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars at neighboring time intervals. Extensive experiments on the recently released event-based automotive detection dataset demonstrate the effectiveness of the proposed method.

翻译：事件相机是受生物启发的传感器，以异步方式捕获每个像素的亮度变化。与基于帧的传感器相比，事件相机具有微秒级延迟和高动态范围，因此在高速移动和光照不足条件下显示出极大的物体检测潜力。由于事件流的稀疏性和异步性质，大多数现有方法借助手工制作的方法将事件数据转换为 2D 网格表示。然而，它们在事件流中聚合信息以进行物体检测方面的效果不佳。在本文中，我们提出了一种针对基于事件的目标检测进行优化的事件表示学习。具体而言，事件流根据 x-y-t 坐标拆分为正负极性的网格，生成一组柱状物作为 3D 张量表示。为了充分利用事件流中的信息以检测物体，提出了一种双存储聚合网络（DMANet），利用事件流中的长期和短期记忆来汇集有效信息以进行物体检测。长期记忆储存在自适应 convLSTM 的隐藏状态中，而短期记忆则由计算相邻时间间隔的事件柱之间的空间-时间关联来建模。对最近公布的基于事件的汽车检测数据集进行了大量实验，证明了所提出方法的有效性。