Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
翻译:当前的多模态大语言模型具备强大的感知与推理能力,然而其高昂的计算与内存需求使其难以直接部署在设备端环境中。尽管小参数量模型正逐步被赋予较强的通用能力,但标准的视觉Transformer(ViT)编码器仍是关键瓶颈,在处理高分辨率输入时面临过高的延迟与内存消耗。为应对这些挑战,我们提出了HyperVL,一种专为设备端推理设计的高效多模态大语言模型。HyperVL采用图像分块策略以限制峰值内存使用,并引入了两项新技术:(1)视觉分辨率压缩器(VRC),可自适应预测最优编码分辨率以消除冗余计算;(2)双重一致性学习(DCL),在统一框架内对齐多尺度ViT编码器,实现在共享大语言模型下视觉分支的动态切换。大量实验表明,HyperVL在多个基准测试中取得了同类尺寸模型中最先进的性能。此外,它在真实移动设备上显著降低了延迟与功耗,证明了其在设备端多模态推理中的实用性。