The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate this, they often compromise the representation of global dependencies or fail to capture multiscale semantic granularity effectively. In this paper, we propose Multiscale Aggregated Hierarchical Attention (MAHA), a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation. Unlike conventional approaches that treat token interactions at a single resolution, MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators. The core innovation lies in its aggregation strategy: we model the fusion of scalespecific attention matrices as a resource allocation problem, solved via a convex optimization framework or a Nash equilibriumbased gametheoretic approach. This ensures a theoretically optimal balance between local nuance and global context fidelity. Implemented within a hybrid dilatedconvolutional transformer backbone, MAHA utilizes differentiable optimization layers to enable endtoend training. Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention. This work bridges the gap between optimization theory and sequence modeling, offering a scalable solution for nextgeneration LLMs.
翻译:多头自注意力(MHSA)的二次计算复杂度仍是扩展大语言模型(LLMs)以处理长上下文任务的根本瓶颈。尽管稀疏化与线性化注意力机制试图缓解此问题,但它们常以牺牲全局依赖关系的表征为代价,或未能有效捕捉多尺度语义粒度。本文提出多尺度聚合层次注意力(MAHA),这是一种通过层次分解与数学严谨的聚合重构注意力机制的新型架构框架。与传统方法在单一分辨率下处理词元交互不同,MAHA通过可学习的下采样算子将输入序列动态划分为层次化尺度。其核心创新在于聚合策略:我们将尺度特定注意力矩阵的融合建模为资源分配问题,通过凸优化框架或基于纳什均衡的博弈论方法求解,从而在理论上确保局部细节与全局上下文保真度之间的最优平衡。在混合扩张卷积-Transformer骨干网络中实现时,MAHA采用可微分优化层以支持端到端训练。实验评估表明,MAHA实现了卓越的可扩展性;经验性FLOPs分析证实,在序列长度为4096时,其计算成本较标准注意力降低81%。本研究弥合了优化理论与序列建模之间的鸿沟,为下一代LLMs提供了可扩展的解决方案。