从理论到吞吐量：面向大批量三维学习的CUDA优化APML (From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning)

Loss functions are fundamental to learning accurate 3D point cloud models, yet common choices trade geometric fidelity for computational cost. Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost. APML approximates transport with differentiable Sinkhorn iterations and an analytically derived temperature, but its dense formulation scales quadratically in memory. We present CUDA-APML, a sparse GPU implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO form. This yields near-linear memory scaling and preserves gradients on the stored support, while pairwise distance evaluation remains quadratic in the current implementation. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by 99.9%. Code available at: https://github.com/Multimodal-Sensing-Lab/apml

翻译：损失函数是学习精确三维点云模型的基础，然而常见选择往往以几何保真度换取计算成本。倒角距离计算高效但允许多对一对应，而推土机距离虽能更好地反映一对一传输，却需高昂计算代价。APML通过可微Sinkhorn迭代与解析推导的温度参数近似传输过程，但其稠密实现存在内存二次方扩展问题。本文提出CUDA-APML——一种稀疏GPU实现方案，通过阈值过滤可忽略的分配关系，在COO格式中直接运行自适应softmax、双向对称化及Sinkhorn归一化操作。该方法实现近线性内存扩展，并在存储支撑集上保持梯度可导性（当前实现中成对距离计算仍保持二次复杂度）。在ShapeNet和MM-Fi数据集上，CUDA-APML在微小容差范围内匹配稠密APML性能，同时将GPU峰值内存降低99.9%。代码发布于：https://github.com/Multimodal-Sensing-Lab/apml