Sparse GNN aggregations (CSR SpMM/SDDMM) vary widely in performance with degree skew, feature width, and GPU micro-architecture. We present AutoSAGE, an input-aware CUDA scheduler that chooses tiling and mapping per input using a lightweight estimate refined by on-device micro-probes, with a guardrail that safely falls back to vendor kernels and a persistent cache for deterministic replay. AutoSAGE covers SpMM and SDDMM and composes into a CSR attention pipeline (SDDMM -> row-softmax -> SpMM). On Reddit and OGBN-Products, it matches vendor baselines at bandwidth-bound feature widths and finds gains at small widths; on synthetic sparsity and skew stress tests it achieves up to 4.7x kernel-level speedups. We release CUDA sources, Python bindings, a reproducible harness, and replayable cache logs.
翻译:稀疏图神经网络聚合运算(CSR格式的SpMM/SDDMM)的性能受节点度分布偏斜、特征维度宽度及GPU微架构差异影响显著。本文提出AutoSAGE,一种基于输入感知的CUDA调度器,通过轻量级预估模型结合设备端微探针优化,为每个输入动态选择分块策略与内存映射方案。该系统配备安全防护机制:可自动回退至厂商优化内核,并采用持久化缓存确保确定性计算重现。AutoSAGE覆盖SpMM与SDDMM两类核心算子,并可组合构建CSR注意力计算流水线(SDDMM → 行级Softmax → SpMM)。在Reddit和OGBN-Products数据集测试中,该系统在带宽受限的特征宽度场景下与厂商基线性能持平,而在小特征宽度条件下实现性能提升;在合成稀疏性与偏斜分布压力测试中,其内核级加速比最高达4.7倍。我们同步开源了CUDA源码、Python接口绑定、可复现测试框架及可重放的缓存日志。