IMSE：基于Inception深度可分离卷积与幅度感知线性注意力的高效U-Net语音增强方法 (IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention)

Achieving a balance between lightweight design and high performance remains a significant challenge for speech enhancement (SE) tasks on resource-constrained devices. Existing state-of-the-art methods, such as MUSE, have established a strong baseline with only 0.51M parameters by introducing a Multi-path Enhanced Taylor (MET) transformer and Deformable Embedding (DE). However, an in-depth analysis reveals that MUSE still suffers from efficiency bottlenecks: the MET module relies on a complex "approximate-compensate" mechanism to mitigate the limitations of Taylor-expansion-based attention, while the offset calculation for deformable embedding introduces additional computational burden. This paper proposes IMSE, a systematically optimized and ultra-lightweight network. We introduce two core innovations: 1) Replacing the MET module with Amplitude-Aware Linear Attention (MALA). MALA fundamentally rectifies the "amplitude-ignoring" problem in linear attention by explicitly preserving the norm information of query vectors in the attention calculation, achieving efficient global modeling without an auxiliary compensation branch. 2) Replacing the DE module with Inception Depthwise Convolution (IDConv). IDConv borrows the Inception concept, decomposing large-kernel operations into efficient parallel branches (square, horizontal, and vertical strips), thereby capturing spectrogram features with extremely low parameter redundancy. Extensive experiments on the VoiceBank+DEMAND dataset demonstrate that, compared to the MUSE baseline, IMSE significantly reduces the parameter count by 16.8\% (from 0.513M to 0.427M) while achieving competitive performance comparable to the state-of-the-art on the PESQ metric (3.373). This study sets a new benchmark for the trade-off between model size and speech quality in ultra-lightweight speech enhancement.

翻译：在资源受限设备上实现语音增强任务的轻量化设计与高性能之间的平衡仍是一项重大挑战。现有最先进方法（如MUSE）通过引入多路径增强泰勒变换器与可变形嵌入，仅用0.51M参数建立了强大基线。然而，深入分析表明MUSE仍存在效率瓶颈：MET模块依赖复杂的“近似-补偿”机制来缓解基于泰勒展开的注意力机制局限，而可变形嵌入的偏移计算则带来额外计算负担。本文提出IMSE——一个系统优化的超轻量网络，其核心创新包括：1）用幅度感知线性注意力替代MET模块。MALA通过在注意力计算中显式保留查询向量的范数信息，从根本上修正线性注意力中“忽略幅度”的问题，无需辅助补偿分支即可实现高效全局建模。2）用Inception深度可分离卷积替代DE模块。IDConv借鉴Inception思想，将大核操作分解为高效并行分支（方形、水平与垂直条状卷积），以极低的参数冗余捕获语谱图特征。在VoiceBank+DEMAND数据集上的大量实验表明：相比MUSE基线，IMSE将参数量显著降低16.8%（从0.513M降至0.427M），同时在PESQ指标上达到与最先进方法相当的竞争性性能（3.373）。本研究为超轻量语音增强中模型规模与语音质量的权衡设立了新基准。