Activation functions play a critical role in deep neural networks by shaping gradient flow, optimization stability, and generalization. While ReLU remains widely used due to its simplicity, it suffers from gradient sparsity and dead-neuron issues and offers no adaptivity to input statistics. Smooth alternatives such as Swish and GELU improve gradient propagation but still apply a fixed transformation regardless of the activation distribution. In this paper, we propose VeLU, a Variance-enhanced Learning Unit that introduces variance-aware and distributionally aligned nonlinearity through a principled combination of ArcTan-ArcSin transformations, adaptive scaling, and Wasserstein-2 regularization (Optimal Transport). This design enables VeLU to modulate its response based on local activation variance, mitigate internal covariate shift at the activation level, and improve training stability without adding learnable parameters or architectural overhead. Extensive experiments across six deep neural networks show that VeLU outperforms ReLU, ReLU6, Swish, and GELU on 12 vision benchmarks. The implementation of VeLU is publicly available in GitHub.
翻译:激活函数通过塑造梯度流、优化稳定性和泛化能力,在深度神经网络中发挥着关键作用。尽管ReLU因其简洁性而被广泛使用,但它存在梯度稀疏性和神经元失活问题,且无法适应输入统计特性。Swish和GELU等平滑替代方案改善了梯度传播,但仍对激活分布施加固定变换。本文提出VeLU(方差增强学习单元),通过ArcTan-ArcSin变换、自适应缩放与Wasserstein-2正则化(最优传输)的原理性组合,引入具备方差感知和分布对齐特性的非线性机制。该设计使VeLU能够基于局部激活方差调节响应,缓解激活层面的内部协变量偏移,并在不增加可学习参数或架构开销的情况下提升训练稳定性。在六种深度神经网络上的大量实验表明,VeLU在12个视觉基准测试中均优于ReLU、ReLU6、Swish和GELU。VeLU的实现已在GitHub开源。