高效脉冲驱动Transformer用于高性能无人机视角地理定位 (Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization)

Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive performance.Our code is available at https://github.com/ISChenawei/SpikeViMFormer

翻译：基于人工神经网络（ANN）的传统无人机视角地理定位（DVGL）方法已取得显著性能。然而，ANN依赖密集计算，导致功耗较高。相比之下，受益于脉冲驱动计算的脉冲神经网络（SNN）天然具有低功耗特性。遗憾的是，SNN在DVGL中的应用潜力尚未得到充分探索。同时，脉冲驱动计算固有的稀疏性在表示学习场景中会导致关键信息丢失，并在对齐异构视觉数据源时难以学习长程依赖关系。为解决这些问题，我们提出了首个专为DVGL设计的SNN框架——SpikeViMFormer。该框架采用轻量级脉冲驱动Transformer主干网络提取粗粒度特征。为缓解关键信息丢失问题，设计了脉冲驱动选择性注意力（SSA）模块，通过脉冲驱动门控机制实现选择性特征增强并突出判别性区域。此外，引入脉冲驱动混合状态空间（SHS）模块，利用混合状态空间学习长程依赖关系。在推理阶段仅使用主干网络以降低计算成本。为确保主干网络有效性，提出了一种新颖的分层重排序对齐学习（HRAL）策略，通过邻域重排序细化特征并保持跨批次一致性以直接优化主干网络。实验结果表明，SpikeViMFormer性能优于最先进的SNN模型，与先进的ANN相比也展现出具有竞争力的性能。我们的代码公开于：https://github.com/ISChenawei/SpikeViMFormer