Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (4): 723-737    DOI: 10.3785/j.issn.1008-973X.2026.04.005
    
Survey on edge deployment and inference acceleration of multimodal large language models
Siru CHEN(),Yuanchao SHU*()
College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China
Download: HTML     PDF(1432KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Significant progress in multimodal large language models (MLLMs) has driven advances in visual question answering, visual understanding, and reasoning tasks, and their potential for deployment on resource-constrained edge devices is increasingly recognized. However, large model sizes and the substantial costs of deployment and inference remain major barriers to practical adoption. Optimizing MLLMs for edge devices has become a critical research direction in this field. A comprehensive survey of recent advances in optimizing MLLMs for edge deployment was presented, along with the associated challenges and development trends. The research evolution of MLLMs on edge devices was reviewed, with particular emphasis on model architecture optimization and inference scheduling strategies. In model architecture optimization, techniques including visual information compression, sparse attention, and mixture-of-experts models were specifically analyzed. System-level optimizations involving computation scheduling, hardware adaptation, compilation optimization, and cloud-edge collaboration were investigated to enhance inference efficiency and energy efficiency. Furthermore, the key challenges of these models in practical applications were discussed, and a variety of task scenarios ranging from assistive to collaborative and autonomous types were covered, categorized by the perspective of autonomy levels. Finally, current limitations were summarized and future research directions regarding standardized deployment, efficient computing and storage, and multi-modal fusion optimization were outlined.



Key wordsmultimodal large language models      edge computing      inference acceleration      model architecture optimization      system-level optimization      edge-cloud collaboration     
Received: 26 October 2025      Published: 19 March 2026
CLC:  TP 393  
Fund:  国家自然科学基金资助项目(92467301);浙江省“尖兵领雁+X”研发攻关计划项目(2025C01012).
Corresponding Authors: Yuanchao SHU     E-mail: siruchen@zju.edu.cn;ycshu@zju.edu.cn
Cite this article:

Siru CHEN,Yuanchao SHU. Survey on edge deployment and inference acceleration of multimodal large language models. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 723-737.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.04.005     OR     https://www.zjujournals.com/eng/Y2026/V60/I4/723


多模态大模型边缘部署与推理加速技术综述

随着多模态大语言模型(MLLMs)在视觉问答、视觉理解和推理任务中取得显著进展,其在网络边缘侧资源受限设备中的应用潜力也日益凸显. 然而,庞大的模型规模和高昂的部署与推理成本仍然是制约其广泛应用的主要瓶颈. 针对边缘侧设备优化的多模态大语言模型已成为该领域的重要研究方向. 本研究综述该领域的最新进展,并分析面临的挑战与发展趋势. 回顾多模态大语言模型在边缘侧设备上的研究历程,重点讨论模型架构优化和推理调度策略. 在模型架构优化方面,特别分析了视觉信息压缩、稀疏注意力机制以及混合专家模型等优化方法. 在系统级优化方面,探讨计算调度、硬件适配、编译优化和云边协同等技术,以提升推理效率和能效. 此外,还讨论了这些模型在实际应用中的关键挑战,并以自治能力为划分视角,覆盖从辅助型到协作型再到自主型的多类任务场景. 最后,总结当前研究的局限性,并展望了未来研究方向,特别是在标准化部署、高效计算与存储以及多模态融合优化方面的潜力.


关键词: 多模态大语言模型,  边缘计算,  推理加速,  模型架构优化,  系统级优化,  云边协同 
Fig.1 Typical architecture of MLLMs
模型名字输入模态输出模态大模型主干总参数量/109时间
1) 注:总参数量中A/E标注分别表示MoE模型激活参数量(如A3)与显存等效参数量(如E2),旨在说明模型在边缘侧的实际计算负载与资源占用
Gemini Nano[10]文本、图像、音频、视频文本Gemini[10]1.8/3.252023.12
MobileVLM[11]文本、图像文本MobileLLaMA[11]1.7/3.02023.12
TinyGPT-V[12]文本、图像文本Phi-2[13]2.82023.12
Vary-toy[14]文本、图像文本Qwen[8]1.82024.01
MobileVLM V2[15]文本、图像文本MobileLLaMA[11]1.7/3.0/7.02024.02
LLaVA-Phi[16]文本、图像文本Phi-2[13]3.02024.02
Cobra[17]文本、图像文本Mamba[18]2.82024.03
Mipha[19]文本、图像文本Phi-2[13]3.02024.03
LLaVA-Gemma[20]文本、图像文本Gemma[21]2.0/7.02024.04
Imp[22]文本、图像文本Phi-2[13]3.02024.05
Bunny[23]文本、图像文本Phi-3[24]4.02024.07
PaliGemma[25]文本、图像文本Gemma[21]3.02024.07
InternVL2[26]文本、图像、视频文本InternLM2[27]1.0/2.0/4.0/8.02024.07
MiniCPM-V 2.6[28]文本、图像、视频文本MiniCPM[29]8.02024.08
Qwen2-VL[30]文本、图像、视频文本Qwen2[31]2.02024.09
GLM-Edge[32]文本、图像文本GLM-4[33]1.5/2.0/4.0/5.02024.11
Ivy-VL[34]文本、图像文本Qwen2.5[35]3.02024.12
InternVL2.5[36]文本、图像、视频文本InternLM2[27]1.0/2.0/4.0/8.02024.12
PaliGemma2[37]文本、图像文本Gemma[21]3.02024.12
MiniCPM-o 2.6[38]文本、图像、音频、视频文本、音频MiniCPM[29]8.02025.01
Megrez-Omni[39]文本、图像、音频文本LLaMA2[40]4.02025.02
SmolVLM2[41]文本、图像文本SmolLM2[42]0.256/0.5/2.22025.02
Moondream[43]文本、图像文本Phi-1.5[44]0.5/2.02025.03
InternVL3[45]文本、图像、视频文本InternLM2[27]1.0/2.0/8.02025.04
Kimi-VL[46]文本、图像、视频文本Moonlight[46]A31)2025.04
Gemma 3n[47]文本、图像、音频、视频文本MatFormer[48]5.0(E2)/8.0(E4)2025.06
BlueLM-2.5[49]文本、图像文本BlueLM[49]2.92025.07
MiniCPM-V 4.5[50]文本、图像、视频文本MiniCPM[29]8.02025.09
Tab.1 Multi-modal large language model on edge side
模型系列模型名称参数量/109
LLaMALLaMA[7]7.0
LLaMA2[40]7.0
LLaMA3.2[67]1.0/3.0
QwenQwen[8]1.8/7.0
Qwen1.5[68]0.5/1.8/4.0/7.0
Qwen2[31]0.5/1.5/7.0
Qwen2.5[35]0.5/1.5/3.0/7.0
Qwen3[69]0.6/1.7/4.0/8.0
VicunaVicuna[70]7.0
MobileLLaMAMobileLLaMA[11]1.3/3.1
GeminiGemini Nano1[10]1.8
Gemini Nano2[10]3.25
PhiPhi-1[71]1.3
Phi-1.5[44]1.3
Phi-2[13]2.7
Phi-3[24]3.8/7.0
InternLMInternLM2[27]1.8
InternLM2.5[72]7.0
TinyLlamaTinyLlama[53]1.1
Tab.2 LLMs backbone for edge-side deployment
Fig.2 Summary of typical general optimization strategies
框架描述
MLC-LLM[89]基于多层次计算的语言模型,采用优化的硬件加速方案,致力于在多种平台上实现高效且高性能的推理
MNN-M[90]MNN框架中的一个模块,专注于模型优化和加速,特别是在移动端和嵌入式设备上,具有较低的资源占用和较快的推理速度
vLLM[91]专为大语言模型优化的推理框架,旨在提高推理效率并降低内存使用,同时支持多种硬件平台的加速
llama.cpp[92]基于LLaMA模型的C++实现,优化了推理性能,适用于低资源环境中的高效推理,并支持多种硬件加速技术
Tab.3 Comparison of typical end-to-end efficient inference frameworks on edge
Fig.3 Cloud-edge codesign framework
[1]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural information Processing Systems. Long Beach: Curran Associates, 2017: 5998−6008.
[2]   ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning [C]// Advances in Neural Information Processing Systems. New Orleans: Curran Associates, 2022: 23716−23736.
[3]   YANG Z, LI L, LIN K, et al. The dawn of lmms: Preliminary explorations with gpt-4v (ision) [EB/OL]. (2023−03−04) [2025−10−17]. https://arxiv.org/abs/2303.08774.
[4]   DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−04) [2025−10−17]. https://arxiv.org/abs/2010.11929.
[5]   DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis: Association for Computational Linguistics, 2019: 4171−4186.
[6]   RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018−06−09) [2025−10−17]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[7]   TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: open and efficient foundation language models [EB/OL]. (2023−02−27) [2025−10−17]. https://arxiv.org/abs/2302.13971.
[8]   BAI J, BAI S, CHU Y, et al. Qwen technical report [EB/OL]. (2023−09−28) [2025−10−17]. https://arxiv.org/abs/2309.16609.
[9]   DRIESS D, XIA F, SAJJADI M S, et al. Palm-e: an embodied multimodal language model [EB/OL]. (2023−03−06) [2025−10−17]. https://arxiv.org/abs/2303.03378.
[10]   TEAM G, ANIL R, BORGEAUD S, et al. Gemini: a family of highly capable multimodal models [EB/OL]. (2025−05−09) [2025−10−17]. https://arxiv.org/abs/2312.11805.
[11]   CHU X, QIAO L, LIN X, et al. Mobilevlm: a fast, strong and open vision language assistant for mobile devices [EB/OL]. (2023−12−30) [2025−10−17]. https://arxiv.org/abs/2312.16886.
[12]   YUAN Z, LI Z, HUANG W, et al. Tinygpt-v: efficient multimodal large language model via small backbones [EB/OL]. (2024−01−21) [2025−10−17]. https://arxiv.org/abs/2312.16862.
[13]   JAVAHERIPI M, BUBECK S, ABDIN M, et al. Phi-2: the surprising power of small language models [EB/OL]. (2023−12−12) [2025−10−17]. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
[14]   WEI H, KONG L, CHEN J, et al. Small language model meets with reinforced vision vocabulary [EB/OL]. (2024−01−23) [2025−10−17]. https://arxiv.org/abs/2401.12503.
[15]   CHU X, QIAO L, ZHANG X, et al. Mobilevlm v2: faster and stronger baseline for vision language model [EB/OL]. (2025−02−06) [2025−10−17]. https://arxiv.org/abs/2402.03766.
[16]   ZHU Y, ZHU M, LIU N, et al. Llava-phi: efficient multi-modal assistant with small language model [C]// Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited. New York: Association for Computing Machinery, 2024: 18−22.
[17]   ZHAO H, ZHANG M, ZHAO W, et al. Cobra: extending mamba to multi-modal large language model for efficient inference [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia: AAAI Press, 2025: 10421−10429.
[18]   GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces [C]// 1st Conference on Language Modeling. Philadelphia: [s. n. ], 2024.
[19]   ZHU M, ZHU Y, LIU X, et al. Mipha: a comprehensive overhaul of multimodal assistant with small language models [EB/OL]. (2024−03−25) [2025−10−17]. https://arxiv.org/abs/2403.06199.
[20]   HINCK M, OLSON M L, COBBLEY D, et al. Llava-gemma: accelerating multimodal foundation models with a compact language [EB/OL]. (2024−06−10) [2025−10−17]. https://arxiv.org/abs/2404.01331.
[21]   TEAM G, MESNARD T, HARDIN C, et al. Gemma: open models based on gemini research and technology [EB/OL]. (2024−04−16) [2025−10−17]. https://arxiv.org/abs/2403.08295.
[22]   SHAO Z, YU Z, YU J, et al Imp: highly capable large multimodal models for mobile devices[J]. IEEE Transactions on Multimedia, 2025, 27: 2961- 2974
doi: 10.1109/TMM.2025.3557680
[23]   HE M, LIU Y, WU B, et al. Efficient multimodal learning from data-centric perspective [EB/OL]. (2024−07−22) [2025−10−17]. https://arxiv.org/abs/2402.11530.
[24]   ABDIN M, ANEJA J, AWADALLA H, et al. Phi-3 technical report: a highly capable language model locally on your phone [EB/OL]. (2024−08−30) [2025−10−17]. https://arxiv.org/abs/2404.14219.
[25]   BEYER L, STEINER A, PINTO A S, et al. Paligemma: a versatile 3b vlm for transfer [EB/OL]. (2024−10−10) [2025−10−17]. https://arxiv.org/abs/2407.07726.
[26]   CHEN Z, WU J, WANG W, et al. Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 24185−24198.
[27]   CAI Z, CAO M, CHEN H, et al. Internlm2 technical report [EB/OL]. (2024−03−26) [2025−10−17]. https://arxiv.org/abs/2403.17297.
[28]   YAO Y, YU T, ZHANG A, et al. MiniCPM-V 2.6: a GPT-4V level MLLM for single image, multi image and video on your phone [EB/OL]. (2024−08−06) [2025−10−17]. https://github.com/nuoan/MiniCPM-V2.6.
[29]   HU S, TU Y, HAN X, et al. Minicpm: unveiling the potential of small language models with scalable training strategies [EB/OL]. (2024−06−03) [2025−10−17]. https://arxiv.org/abs/2404.06395.
[30]   WANG P, BAI S, TAN S, et al. Qwen2-vl: enhancing vision-language model’s perception of the world at any resolution[EB/OL]. (2024−10−03) [2025−10−17]. https://arxiv.org/abs/2409.12191.
[31]   TEAM Q. Qwen2 technical report [EB/OL]. (2024−09−10) [2025−10−17]. https://arxiv.org/abs/2407.10671.
[32]   GLM series edge models [EB/OL]. (2025−06−12) [2025−10−17]. https://github.com/zai-org/GLM-Edge.
[33]   GLM T, ZENG A, XU B, et al. Chatglm: a family of large language models from glm-130b to glm-4 all tools [EB/OL]. (2024−07−30) [2025−10−17]. https://arxiv.org/abs/2406.12793.
[34]   ZHANG I, PENG W, JENNY N, et al. Ivy-VL: compact vision-language models achieving SOTA with optimal data [EB/OL]. (2024−12−01) [2025−10−17]. https://huggingface.co/AI-Safeguard/Ivy-VL-llava.
[35]   QWEN TEAM. Qwen2.5: a party of foundation models [EB/OL]. (2024−09−01) [2025−10−17]. https://qwenlm.github.io/blog/qwen2.5/.
[36]   CHEN Z, WANG W, CAO Y, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling [EB/OL]. (2025−09−26) [2025−10−17]. https://arxiv.org/abs/2412.05271.
[37]   STEINER A, PINTO A S, TSCHANNEN M, et al. Paligemma 2: a family of versatile vlms for transfer [EB/OL]. (2024−12−04) [2025−10−17]. https://arxiv.org/abs/2412.03555.
[38]   YAO Y, YU T, ZHANG A, et al. MiniCPM-o 2.6: a GPT-4o level MLLM for vision, speech and multimodal live streaming on your phone [EB/OL]. (2025−01−24) [2025−10−17]. https://github.com/shaneholloman/minicpm-o.
[39]   LI B, LI Y, LI Z, et al. Megrez-omni technical report [EB/OL]. (2025−02−19) [2025−10−17]. https://arxiv.org/abs/2502.15803.
[40]   TOUVRON H, MARTIN L, STONE K, et al. Llama 2: open foundation and fine-tuned chat models [EB/OL]. (2023−07−19) [2025−10−17]. https://arxiv.org/abs/2307.09288.
[41]   MARAFIOTI A, ZOHAR O, FARRÉ M, et al. Smolvlm: redefining small and efficient multimodal models [EB/OL]. (2025−04−07) [2025−10−17]. https://arxiv.org/abs/2504.05299.
[42]   ALLAL L B, LOZHKOV A, BAKOUCH E, et al. SmolLM2: when smol goes big--data-centric training of a small language model [EB/OL]. (2025−02−24) [2025−10−17]. https://arxiv.org/abs/2502.02737.
[43]   M87 LABS, INC. Moondream[EB/OL]. (2025−03−27) [2025−10−17]. https://moondream.ai/.
[44]   LI Y, BUBECK S, ELDAN R, et al. Textbooks are all you need ii: phi-1.5 technical report [EB/OL]. (2023−09−11) [2025−10−17]. https://arxiv.org/abs/2309.05463.
[45]   ZHU J, WANG W, CHEN Z, et al. Internvl3: exploring advanced training and test-time recipes for open-source multimodal models [EB/OL]. (2025−04−19) [2025−10−17]. https://arxiv.org/abs/2504.10479.
[46]   TEAM K, DU A, YIN B, et al. Kimi-vl technical report [EB/OL]. (2025−06−23) [2025−10−17]. https://arxiv.org/abs/2504.07491.
[47]   TEAM G, KAMATH A, FERRET J, et al. Gemma 3n model overview [EB/OL]. (2025−06−30) [2025−10−17]. https://ai.google.dev/gemma/docs/gemma-3n.
[48]   DEVVRIT F, KUDUGUNTA S, KUSUPATI A, et al. Matformer: nested transformer for elastic inference [C]// Advances in Neural Information Processing Systems. Vancouver: Curran Associates, 2024: 140535−140564.
[49]   XIONG B, CHEN B, WANG C, et al. BlueLM-2.5-3B technical report [EB/OL]. (2025−07−08) [2025−10−17]. https://arxiv.org/abs/2507.05934.
[50]   YU T, WANG Z, WANG C, et al. Minicpm-v 4.5: cooking efficient mllms via architecture, data, and training recipe[EB/OL]. (2025−09−16) [2025−10−17]. https://arxiv.org/abs/2509.18154.
[51]   LI J, LI D, SAVARESE S, et al. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models [C]// International Conference on Machine Learning. Hawaii: JMLR, 2023: 19730−19742.
[52]   LIU H, LI C, WU Q, et al. Visual instruction tuning [C]// Advances in Neural Information Processing Systems. New Orleans: Curran Associates, 2023: 34892−34916.
[53]   ZHANG P, ZENG G, WANG T, et al. Tinyllama: an open-source small language model [EB/OL]. (2024−07−04) [2025−10−17]. https://arxiv.org/abs/2401.02385.
[54]   ZHANG S, FANG Q, YANG Z, et al. Llava-mini: efficient image and video large multimodal models with one vision token [EB/OL]. (2025−03−02) [2025−10−17]. https://arxiv.org/abs/2501.03895.
[55]   SHAO K, TAO K, ZHANG K, et al. When tokens talk too much: a survey of multimodal long-context token compression across images, videos, and audios [EB/OL]. (2025−08−28) [2025−10−17]. https://arxiv.org/abs/2507.20198.
[56]   GAO Z, CHEN Z, CUI E, et al Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance[J]. Visual Intelligence, 2024, 2 (1): 32
doi: 10.1007/s44267-024-00067-6
[57]   LI B, ZHANG Y, GUO D, et al. Llava-onevision: easy visual task transfer [EB/OL]. (2024−10−26) [2025−10−17]. https://arxiv.org/abs/2408.03326.
[58]   BAI S, CHEN K, LIU X, et al. Qwen2.5-vl technical report [EB/OL]. (2025−02−19) [2025−10−17]. https://arxiv.org/abs/2502.13923.
[59]   WANG H, YU Z, SPADARO G, et al. Folder: Accelerating multi-modal large language models with enhanced performance [EB/OL]. (2025−04−10) [2025−10−17]. https://arxiv.org/abs/2501.02430.
[60]   SUN B, ZHANG Y, JIANG S, et al. Hybrid pixel-unshuffled network for lightweight image super-resolution [C]// Proceedings of the AAAI conference on artificial intelligence. Washington DC: AAAI Press, 2023: 2375−2383.
[61]   CHEN L, ZHAO H, LIU T, et al. An image is worth 1/2 tokens after layer 2: plug-and-play inference acceleration for large vision-language models [C]// European Conference on Computer Vision. Milan: Springer Nature Switzerland, 2024: 19−35.
[62]   HAN Y, LIU X, DING P, et al. Rethinking token reduction in mllms: towards a unified paradigm for training-free acceleration [EB/OL]. (2024−12−04) [2025−10−17]. https://arxiv.org/html/2411.17686v2.
[63]   ZHAO Z, LI Y, LI Y. Learning free token reduction for multi-modal large language models [EB/OL]. (2025−09−30) [2025−10−17]. https://arxiv.org/abs/2501.17391.
[64]   HEO B, PARK S, HAN D, et al. Rotary position embedding for vision transformer [C]// European Conference on Computer Vision. Milan: Springer Nature Switzerland, 2024: 289−305.
[65]   LI W, ZHOU H, YU J, et al. Coupled mamba: enhanced multimodal fusion with coupled state space model [C]// Advances in Neural Information Processing Systems. Vancouver: Curran Associates, 2024: 59808−59832.
[66]   HU Y, FAN Z, WANG X, et al. TinyAlign: boosting lightweight vision-language models by mitigating modal alignment bottlenecks [EB/OL]. (2025−06−30) [2025−10−17]. https://arxiv.org/abs/2505.12884.
[67]   META. Llama 3.2: revolutionizing edge AI and vision with open, customizable models [EB/OL]. (2024−09−25) [2025−10−17]. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/.
[68]   TEAM Q. Qwen1.5 [EB/OL]. (2024−02−05) [2025−10−17]. https://github.com/QwenLM/Qwen1.5.
[69]   YANG A, LI A, YANG B, et al. Qwen3 technical report [EB/OL]. (2025−05−14) [2025−10−17]. https://arxiv.org/abs/2505.09388.
[70]   CHIANG W L, LI Z, LIN Z, et al. Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality [EB/OL]. (2023−04−14) [2025−10−17]. https://vicuna. lmsys.org.
[71]   GUNASEKAR S, ZHANG Y, ANEJA J, et al. Textbooks are all you need [EB/OL]. (2023−10−02) [2025−10−17]. https://arxiv.org/abs/2306.11644.
[72]   WU Z, HUANG S, ZHOU Z, et al. InternLM2. 5-stepprover: advancing automated theorem proving via critic-guided search [C]// 2nd AI for Math Workshop@ ICML 2025. Vancouver: PmLR, 2025.
[73]   BELLAGENTE M, TOW J, MAHAN D, et al. Stable LM 2 1.6B technical report [EB/OL]. (2024−02−27) [2025−10−17]. https://arxiv.org/abs/2402.17834.
[74]   MATHEW M, KARATZAS D, JAWAHAR C V. Docvqa: a dataset for vqa on document images [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. [S. l.]: IEEE, 2021: 2200−2209.
[75]   KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. Referitgame: referring to objects in photographs of natural scenes [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014: 787−798.
[76]   TEAM Q. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond [EB/OL]. (2023−10−13) [2025−10−17]. https://arxiv.org/abs/2308.12966.
[77]   JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts [EB/OL]. (2024−01−08) [2025−10−17]. https://arxiv.org/abs/2401.04088.
[78]   LIN B, TANG Z, YE Y, et al. Moe-llava: mixture of experts for large vision-language models [EB/OL]. (2024−12−23) [2025−10−17]. https://arxiv.org/abs/2401.15947.
[79]   YUE Y, WANG Y, KANG B, et al. Deer-vla: dynamic inference of multimodal large language models for efficient robot execution [C]// Advances in Neural Information Processing Systems. Vancouver: Curran Associates, 2024: 56619−56643.
[80]   FENG Q, LI W, LIN T, et al. Align-KD: distilling cross-modal alignment knowledge for mobile vision-language large model enhancement [C]// Proceedings of the Computer Vision and Pattern Recognition Conference. Nashville: IEEE, 2025: 4178−4188.
[81]   KOSKA B, HORVÁTH M. Towards multi-modal mastery: a 4.5 B parameter truly multi-modal small language model [C]// 2024 2nd International Conference on Foundation and Large Language Models (FLLM). [S. l.]: IEEE, 2024: 587−592.
[82]   LIN H, BAI H, LIU Z, et al. Mope-clip: structured pruning for efficient vision-language models with module-wise pruning error metric [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 27370−27380.
[83]   RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// International conference on machine learning. [S. l.]: PmLR, 2021: 8748−8763.
[84]   NING Z, ZHAO J, JIN Q, et al. Inf-MLLM: efficient streaming inference of multimodal large language models on a single GPU [EB/OL]. (2024−09−11) [2025−10−17]. https://arxiv.org/abs/2409.09086.
[85]   HAN I, ZHANG Z, WANG Z, et al. CalibQuant: 1-Bit KV cache quantization for multimodal LLMs [EB/OL]. (2025−03−24) [2025−10−17]. https://arxiv.org/abs/2502.14882.
[86]   GAGRANI M, GOEL R, JEON W, et al. On speculative decoding for multimodal large language models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 8285−8289.
[87]   BAI K, YE L, HUANG R, et al. EdgeMM: multi-core CPU with heterogeneous AI-extension and activation-aware weight pruning for multimodal LLMs at edge [EB/OL]. (2025−05−16) [2025−10−17]. https://arxiv.org/abs/2505.10782.
[88]   Dimensity 9300 NPU [EB/OL]. (2024−12−23) [2025−10−17]. https://www.mediatek.com/products/smartphones/mediatek-dimensity-9300.
[89]   MLC TEAM. MLC LLM [EB/OL]. (2025−10−10) [2025−10−17]. https://github.com/mlc-ai/mlc-llm.
[90]   LV C, NIU C, GU R, et al. Walle: an end-to-end, general-purpose, and large-scale production system for device-cloud collaborative machine learning [C]// 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad: USENIX Association, 2022: 249−265.
[91]   KWON W, LI Z, ZHUANG S, et al. Efficient memory management for large language model serving with pagedattention [C]// Proceedings of the 29th symposium on operating systems principles. Koblenz: Association for Computing Machinery, 2023: 611−626.
[92]   GEORGI G. llama. cpp [EB/OL]. [2025−10−17]. https://github.com/ggml-org/llama.cpp.
[93]   NVIDIA DEVELOPER. Jetson modules, support, ecosystem, and lineup [EB/OL]. [2025−10−17]. https://developer.nvidia.com/embedded/jetson-modules.
[94]   TVM TEAM. Apache TVM [EB/OL]. (2025−10−10) [2025−10−17]. https://tvm.apache.ac.cn/.
[95]   ABID A, ABDALLA A, ABID A, et al. Gradio: Hassle-free sharing and testing of ml models in the wild [EB/OL]. (2019−06−06) [2025−10−17]. https://arxiv.org/abs/1906.02569.
[96]   RJOUB G, ELMEKKI H, ISLAM S, et al A hybrid swarm intelligence approach for optimizing Multimodal Large Language Models deployment in edge-cloud-based Federated Learning environments[J]. Computer Communications, 2025, 237 (C): 108152
doi: 10.1016/j.comcom.2025.108152
[97]   HU Y, YE D, KANG J, et al A cloud-edge collaborative architecture for multimodal LLMS-based advanced driver assistance systems in IOT networks[J]. IEEE Internet of Things Journal, 2025, 12 (10): 13208- 13221
doi: 10.1109/JIOT.2024.3509628
[98]   HONG W, WANG W, DING M, et al. Cogvlm2: visual language models for image and video understanding [EB/OL]. (2024−08−29) [2025−10−17]. https://arxiv.org/abs/2408.16500.
[99]   ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[EB/OL]. (2024−03−04) [2025−10−17]. https://arxiv.org/abs/2303.08774.
[100]   GAO Z, ZHANG B, LI P, et al. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage [C]// 2025 IEEE International Conference on Learning Representation. Singapore: IEEE, 2025.
[101]   ZIRUI S, YAOHANG L, MENG F, et al. Mmac-copilot: multi-modal agent collaboration operating system copilot [EB/OL]. (2025−03−23) [2025−10−17]. https://arxiv.org/abs/2404.18074.
[102]   ZHANG C, YANG Z, LIU J, et al. Appagent: Multimodal agents as smartphone users [C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Yokohama: Association for Computing Machinery, 2025.
[103]   LI Y, ZHANG C, YANG W, et al. Appagent v2: advanced agent for flexible mobile interactions [EB/OL]. (2025−09−17) [2025−10−17]. https://arxiv.org/abs/2408.11824.
[104]   YI B, HU X, CHEN Y, et al. EcoAgent: an efficient edge-cloud collaborative multi-agent framework for mobile automation [EB/OL]. (2025−05−09) [2025−10−17]. https://arxiv.org/abs/2505.05440.
[105]   WANG J, XU H, YE J, et al. Mobile-agent: autonomous multi-modal mobile device agent with visual perception [EB/OL]. (2024−04−18) [2025−10−17]. https://arxiv.org/abs/2401.16158.
[106]   LIU S, ZENG Z, REN T, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection [C]// European Conference on Computer Vision. Milan: Springer Nature Switzerland, 2024: 38−55.
[107]   XU Z, ZHANG Y, XIE E, et al Drivegpt4: interpretable end-to-end autonomous driving via large language model[J]. IEEE Robotics and Automation Letters, 2024, 9 (10): 8186- 8193
doi: 10.1109/LRA.2024.3440097
[108]   XU Z, BAI Y, ZHANG Y, et al. DriveGPT4-V2: harnessing large language model capabilities for enhanced closed-loop autonomous driving [C]// Proceedings of the Computer Vision and Pattern Recognition Conference. Nashville: IEEE, 2025: 17261−17270.
[109]   ZHENG Y, XING Z, ZHANG Q, et al. Planagent: a multi-modal large language agent for closed-loop vehicle motion planning [EB/OL]. (2024−06−04) [2025−10−17]. https://arxiv.org/abs/2406.01587.
[110]   ONG X, DING P, FAN Y, et al. Quart-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning [C]// 2025 IEEE International Conference on Robotics and Automation (ICRA). Atlanta: IEEE, 2025: 9533−9539.
[111]   YAN F, LIU F, ZHENG L, et al. Robomm: all-in-one multimodal large model for robotic manipulation [EB/OL]. (2024−12−10) [2025−10−17]. https://arxiv.org/abs/2412.07215.
[112]   LIU J, LI C, WANG G, et al. Self-corrected multimodal large language model for end-to-end robot manipulation [EB/OL]. (2024−05−27) [2025−10−17]. https://arxiv.org/html/2405.17418v1.
[113]   TLUO G, YANG G, GONG Z, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces [EB/OL]. (2025−05−30) [2025−10−17]. https://arxiv.org/abs/2506.00123.
[114]   CHEN J, LIANG H, DU L, et al. OWMM-Agent: open world mobile manipulation with multi-modal agentic data synthesis [EB/OL]. (2025−06−21) [2025−10−17]. https://arxiv.org/abs/2506.04217.
[1] Yongqing XIAO,Yubo LU,Xingmeng YANG,Jianwei WEI,Lin SU,Xinjian HAO,Guanding YU. Joint optimization for task offloading and resource allocation in multi-layer edge computing networks[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2506-2515.
[2] Yaoping ZENG,Yueqiang LIU,Saishen GUAN,Weiwei JIANG,Yuting XIA. Computational offloading in D2D-MEC with energy harvesting[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 967-978.
[3] Xue-jiao LIU,Qing-wu SONG,Ying-jie XIA. Secure computation offloading scheme for matrix in Internet of vehicles based on blockchain[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(1): 144-154.
[4] Zhong CHEN,Xiao XU,Hai-wei WANG,Hong-hao LUO,Xuan CHEN. Optimization strategy for unloading power tasks in residential areas based on alternate edge nodes[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 917-926.
[5] Ping QI,Hong SHU. Task offloading strategy considering terminal mobility in medical wisdom scenario[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(6): 1126-1137.