AutoSAGE：面向稀疏图神经网络聚合（SpMM/SDDMM）与CSR注意力机制的输入感知型CUDA调度器 (AutoSAGE: Input-Aware CUDA Scheduling for Sparse GNN Aggregation (SpMM/SDDMM) and CSR Attention)

Sparse GNN aggregations (CSR SpMM/SDDMM) vary widely in performance with degree skew, feature width, and GPU micro-architecture. We present AutoSAGE, an input-aware CUDA scheduler that chooses tiling and mapping per input using a lightweight estimate refined by on-device micro-probes, with a guardrail that safely falls back to vendor kernels and a persistent cache for deterministic replay. AutoSAGE covers SpMM and SDDMM and composes into a CSR attention pipeline (SDDMM -> row-softmax -> SpMM). On Reddit and OGBN-Products, it matches vendor baselines at bandwidth-bound feature widths and finds gains at small widths; on synthetic sparsity and skew stress tests it achieves up to 4.7x kernel-level speedups. We release CUDA sources, Python bindings, a reproducible harness, and replayable cache logs.

翻译：稀疏图神经网络聚合运算（CSR格式的SpMM/SDDMM）的性能受节点度分布偏斜、特征维度宽度及GPU微架构差异影响显著。本文提出AutoSAGE，一种基于输入感知的CUDA调度器，通过轻量级预估模型结合设备端微探针优化，为每个输入动态选择分块策略与内存映射方案。该系统配备安全防护机制：可自动回退至厂商优化内核，并采用持久化缓存确保确定性计算重现。AutoSAGE覆盖SpMM与SDDMM两类核心算子，并可组合构建CSR注意力计算流水线（SDDMM → 行级Softmax → SpMM）。在Reddit和OGBN-Products数据集测试中，该系统在带宽受限的特征宽度场景下与厂商基线性能持平，而在小特征宽度条件下实现性能提升；在合成稀疏性与偏斜分布压力测试中，其内核级加速比最高达4.7倍。我们同步开源了CUDA源码、Python接口绑定、可复现测试框架及可重放的缓存日志。