多模态大模型边缘部署与推理加速技术综述

doi:10.3785/j.issn.1008-973X.2026.04.005

浙江大学学报(工学版)

2026, Vol. 60

Issue (4): 723-737 DOI: 10.3785/j.issn.1008-973X.2026.04.005

计算机技术

多模态大模型边缘部署与推理加速技术综述

陈思如(

),舒元超*(

)

浙江大学控制科学与工程学院，浙江杭州 310027

Survey on edge deployment and inference acceleration of multimodal large language models

Siru CHEN(

),Yuanchao SHU*(

)

College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China

全文: PDF(1432 KB) HTML

摘要：

随着多模态大语言模型（MLLMs）在视觉问答、视觉理解和推理任务中取得显著进展，其在网络边缘侧资源受限设备中的应用潜力也日益凸显. 然而，庞大的模型规模和高昂的部署与推理成本仍然是制约其广泛应用的主要瓶颈. 针对边缘侧设备优化的多模态大语言模型已成为该领域的重要研究方向. 本研究综述该领域的最新进展，并分析面临的挑战与发展趋势. 回顾多模态大语言模型在边缘侧设备上的研究历程，重点讨论模型架构优化和推理调度策略. 在模型架构优化方面，特别分析了视觉信息压缩、稀疏注意力机制以及混合专家模型等优化方法. 在系统级优化方面，探讨计算调度、硬件适配、编译优化和云边协同等技术，以提升推理效率和能效. 此外，还讨论了这些模型在实际应用中的关键挑战，并以自治能力为划分视角，覆盖从辅助型到协作型再到自主型的多类任务场景. 最后，总结当前研究的局限性，并展望了未来研究方向，特别是在标准化部署、高效计算与存储以及多模态融合优化方面的潜力.

关键词： 多模态大语言模型; 边缘计算; 推理加速; 模型架构优化; 系统级优化; 云边协同

Abstract:

Significant progress in multimodal large language models (MLLMs) has driven advances in visual question answering, visual understanding, and reasoning tasks, and their potential for deployment on resource-constrained edge devices is increasingly recognized. However, large model sizes and the substantial costs of deployment and inference remain major barriers to practical adoption. Optimizing MLLMs for edge devices has become a critical research direction in this field. A comprehensive survey of recent advances in optimizing MLLMs for edge deployment was presented, along with the associated challenges and development trends. The research evolution of MLLMs on edge devices was reviewed, with particular emphasis on model architecture optimization and inference scheduling strategies. In model architecture optimization, techniques including visual information compression, sparse attention, and mixture-of-experts models were specifically analyzed. System-level optimizations involving computation scheduling, hardware adaptation, compilation optimization, and cloud-edge collaboration were investigated to enhance inference efficiency and energy efficiency. Furthermore, the key challenges of these models in practical applications were discussed, and a variety of task scenarios ranging from assistive to collaborative and autonomous types were covered, categorized by the perspective of autonomy levels. Finally, current limitations were summarized and future research directions regarding standardized deployment, efficient computing and storage, and multi-modal fusion optimization were outlined.

Key words: multimodal large language models edge computing inference acceleration model architecture optimization system-level optimization edge-cloud collaboration

收稿日期: 2025-10-26 出版日期: 2026-03-19

CLC:

TP 393

基金资助: 国家自然科学基金资助项目（92467301）；浙江省“尖兵领雁+X”研发攻关计划项目（2025C01012）.

通讯作者: 舒元超 E-mail: siruchen@zju.edu.cn;ycshu@zju.edu.cn

作者简介: 陈思如（2003—），女，硕士生，从事边缘计算研究. orcid.org/0009-0007-7355-9840. E-mail：siruchen@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	陈思如
	舒元超

引用本文:

陈思如,舒元超. 多模态大模型边缘部署与推理加速技术综述[J]. 浙江大学学报(工学版), 2026, 60(4): 723-737.

Siru CHEN,Yuanchao SHU. Survey on edge deployment and inference acceleration of multimodal large language models. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 723-737.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.04.005 或 https://www.zjujournals.com/eng/CN/Y2026/V60/I4/723

图 1 典型多模态大语言模型结构

模型名字	输入模态	输出模态	大模型主干	总参数量/10⁹	时间
1）注：总参数量中A/E标注分别表示MoE模型激活参数量（如A3）与显存等效参数量（如E2），旨在说明模型在边缘侧的实际计算负载与资源占用
Gemini Nano^[10]	文本、图像、音频、视频	文本	Gemini^[10]	1.8/3.25	2023.12
MobileVLM^[11]	文本、图像	文本	MobileLLaMA^[11]	1.7/3.0	2023.12
TinyGPT-V^[12]	文本、图像	文本	Phi-2^[13]	2.8	2023.12
Vary-toy^[14]	文本、图像	文本	Qwen^[8]	1.8	2024.01
MobileVLM V2^[15]	文本、图像	文本	MobileLLaMA^[11]	1.7/3.0/7.0	2024.02
LLaVA-Phi^[16]	文本、图像	文本	Phi-2^[13]	3.0	2024.02
Cobra^[17]	文本、图像	文本	Mamba^[18]	2.8	2024.03
Mipha^[19]	文本、图像	文本	Phi-2^[13]	3.0	2024.03
LLaVA-Gemma^[20]	文本、图像	文本	Gemma^[21]	2.0/7.0	2024.04
Imp^[22]	文本、图像	文本	Phi-2^[13]	3.0	2024.05
Bunny^[23]	文本、图像	文本	Phi-3^[24]	4.0	2024.07
PaliGemma^[25]	文本、图像	文本	Gemma^[21]	3.0	2024.07
InternVL2^[26]	文本、图像、视频	文本	InternLM2^[27]	1.0/2.0/4.0/8.0	2024.07
MiniCPM-V 2.6^[28]	文本、图像、视频	文本	MiniCPM^[29]	8.0	2024.08
Qwen2-VL^[30]	文本、图像、视频	文本	Qwen2^[31]	2.0	2024.09
GLM-Edge^[32]	文本、图像	文本	GLM-4^[33]	1.5/2.0/4.0/5.0	2024.11
Ivy-VL^[34]	文本、图像	文本	Qwen2.5^[35]	3.0	2024.12
InternVL2.5^[36]	文本、图像、视频	文本	InternLM2^[27]	1.0/2.0/4.0/8.0	2024.12
PaliGemma2^[37]	文本、图像	文本	Gemma^[21]	3.0	2024.12
MiniCPM-o 2.6^[38]	文本、图像、音频、视频	文本、音频	MiniCPM^[29]	8.0	2025.01
Megrez-Omni^[39]	文本、图像、音频	文本	LLaMA2^[40]	4.0	2025.02
SmolVLM2^[41]	文本、图像	文本	SmolLM2^[42]	0.256/0.5/2.2	2025.02
Moondream^[43]	文本、图像	文本	Phi-1.5^[44]	0.5/2.0	2025.03
InternVL3^[45]	文本、图像、视频	文本	InternLM2^[27]	1.0/2.0/8.0	2025.04
Kimi-VL^[46]	文本、图像、视频	文本	Moonlight^[46]	A3^1）	2025.04
Gemma 3n^[47]	文本、图像、音频、视频	文本	MatFormer^[48]	5.0(E2)/8.0(E4)	2025.06
BlueLM-2.5^[49]	文本、图像	文本	BlueLM^[49]	2.9	2025.07
MiniCPM-V 4.5^[50]	文本、图像、视频	文本	MiniCPM^[29]	8.0	2025.09

表 1 边缘侧多模态大语言模型

表 2 适用于边缘侧部署的语言模型主干

图 2 典型整体优化策略汇总

表 3 典型边缘侧端到端高效推理框架对比

图 3 云边协同框架

1	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural information Processing Systems. Long Beach: Curran Associates, 2017: 5998−6008.
2	ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning [C]// Advances in Neural Information Processing Systems. New Orleans: Curran Associates, 2022: 23716−23736.
3	YANG Z, LI L, LIN K, et al. The dawn of lmms: Preliminary explorations with gpt-4v (ision) [EB/OL]. (2023−03−04) [2025−10−17]. https://arxiv.org/abs/2303.08774.
4	DOSOVITSKIY A. An image is worth 16x16 words: Transformers for image recognition at scale [EB/OL]. (2021−06−04) [2025−10−17]. https://arxiv.org/abs/2010.11929.
5	DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis: Association for Computational Linguistics, 2019: 4171−4186.
6	RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018−06−09) [2025−10−17]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
7	TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: open and efficient foundation language models [EB/OL]. (2023−02−27) [2025−10−17]. https://arxiv.org/abs/2302.13971.
8	BAI J, BAI S, CHU Y, et al. Qwen technical report [EB/OL]. (2023−09−28) [2025−10−17]. https://arxiv.org/abs/2309.16609.
9	DRIESS D, XIA F, SAJJADI M S, et al. Palm-e: an embodied multimodal language model [EB/OL]. (2023−03−06) [2025−10−17]. https://arxiv.org/abs/2303.03378.
10	TEAM G, ANIL R, BORGEAUD S, et al. Gemini: a family of highly capable multimodal models [EB/OL]. (2025−05−09) [2025−10−17]. https://arxiv.org/abs/2312.11805.
11	CHU X, QIAO L, LIN X, et al. Mobilevlm: a fast, strong and open vision language assistant for mobile devices [EB/OL]. (2023−12−30) [2025−10−17]. https://arxiv.org/abs/2312.16886.
12	YUAN Z, LI Z, HUANG W, et al. Tinygpt-v: efficient multimodal large language model via small backbones [EB/OL]. (2024−01−21) [2025−10−17]. https://arxiv.org/abs/2312.16862.
13	JAVAHERIPI M, BUBECK S, ABDIN M, et al. Phi-2: the surprising power of small language models [EB/OL]. (2023−12−12) [2025−10−17]. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
14	WEI H, KONG L, CHEN J, et al. Small language model meets with reinforced vision vocabulary [EB/OL]. (2024−01−23) [2025−10−17]. https://arxiv.org/abs/2401.12503.
15	CHU X, QIAO L, ZHANG X, et al. Mobilevlm v2: faster and stronger baseline for vision language model [EB/OL]. (2025−02−06) [2025−10−17]. https://arxiv.org/abs/2402.03766.
16	ZHU Y, ZHU M, LIU N, et al. Llava-phi: efficient multi-modal assistant with small language model [C]// Proceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited. New York: Association for Computing Machinery, 2024: 18−22.
17	ZHAO H, ZHANG M, ZHAO W, et al. Cobra: extending mamba to multi-modal large language model for efficient inference [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Philadelphia: AAAI Press, 2025: 10421−10429.
18	GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces [C]// 1st Conference on Language Modeling. Philadelphia: [s. n. ], 2024.
19	ZHU M, ZHU Y, LIU X, et al. Mipha: a comprehensive overhaul of multimodal assistant with small language models [EB/OL]. (2024−03−25) [2025−10−17]. https://arxiv.org/abs/2403.06199.
20	HINCK M, OLSON M L, COBBLEY D, et al. Llava-gemma: accelerating multimodal foundation models with a compact language [EB/OL]. (2024−06−10) [2025−10−17]. https://arxiv.org/abs/2404.01331.
21	TEAM G, MESNARD T, HARDIN C, et al. Gemma: open models based on gemini research and technology [EB/OL]. (2024−04−16) [2025−10−17]. https://arxiv.org/abs/2403.08295.
22	SHAO Z, YU Z, YU J, et al Imp: highly capable large multimodal models for mobile devices[J]. IEEE Transactions on Multimedia, 2025, 27: 2961- 2974 doi: 10.1109/TMM.2025.3557680
23	HE M, LIU Y, WU B, et al. Efficient multimodal learning from data-centric perspective [EB/OL]. (2024−07−22) [2025−10−17]. https://arxiv.org/abs/2402.11530.
24	ABDIN M, ANEJA J, AWADALLA H, et al. Phi-3 technical report: a highly capable language model locally on your phone [EB/OL]. (2024−08−30) [2025−10−17]. https://arxiv.org/abs/2404.14219.
25	BEYER L, STEINER A, PINTO A S, et al. Paligemma: a versatile 3b vlm for transfer [EB/OL]. (2024−10−10) [2025−10−17]. https://arxiv.org/abs/2407.07726.
26	CHEN Z, WU J, WANG W, et al. Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 24185−24198.
27	CAI Z, CAO M, CHEN H, et al. Internlm2 technical report [EB/OL]. (2024−03−26) [2025−10−17]. https://arxiv.org/abs/2403.17297.
28	YAO Y, YU T, ZHANG A, et al. MiniCPM-V 2.6: a GPT-4V level MLLM for single image, multi image and video on your phone [EB/OL]. (2024−08−06) [2025−10−17]. https://github.com/nuoan/MiniCPM-V2.6.
29	HU S, TU Y, HAN X, et al. Minicpm: unveiling the potential of small language models with scalable training strategies [EB/OL]. (2024−06−03) [2025−10−17]. https://arxiv.org/abs/2404.06395.
30	WANG P, BAI S, TAN S, et al. Qwen2-vl: enhancing vision-language model’s perception of the world at any resolution[EB/OL]. (2024−10−03) [2025−10−17]. https://arxiv.org/abs/2409.12191.
31	TEAM Q. Qwen2 technical report [EB/OL]. (2024−09−10) [2025−10−17]. https://arxiv.org/abs/2407.10671.
32	GLM series edge models [EB/OL]. (2025−06−12) [2025−10−17]. https://github.com/zai-org/GLM-Edge.
33	GLM T, ZENG A, XU B, et al. Chatglm: a family of large language models from glm-130b to glm-4 all tools [EB/OL]. (2024−07−30) [2025−10−17]. https://arxiv.org/abs/2406.12793.
34	ZHANG I, PENG W, JENNY N, et al. Ivy-VL: compact vision-language models achieving SOTA with optimal data [EB/OL]. (2024−12−01) [2025−10−17]. https://huggingface.co/AI-Safeguard/Ivy-VL-llava.
35	QWEN TEAM. Qwen2.5: a party of foundation models [EB/OL]. (2024−09−01) [2025−10−17]. https://qwenlm.github.io/blog/qwen2.5/.
36	CHEN Z, WANG W, CAO Y, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling [EB/OL]. (2025−09−26) [2025−10−17]. https://arxiv.org/abs/2412.05271.
37	STEINER A, PINTO A S, TSCHANNEN M, et al. Paligemma 2: a family of versatile vlms for transfer [EB/OL]. (2024−12−04) [2025−10−17]. https://arxiv.org/abs/2412.03555.
38	YAO Y, YU T, ZHANG A, et al. MiniCPM-o 2.6: a GPT-4o level MLLM for vision, speech and multimodal live streaming on your phone [EB/OL]. (2025−01−24) [2025−10−17]. https://github.com/shaneholloman/minicpm-o.
39	LI B, LI Y, LI Z, et al. Megrez-omni technical report [EB/OL]. (2025−02−19) [2025−10−17]. https://arxiv.org/abs/2502.15803.
40	TOUVRON H, MARTIN L, STONE K, et al. Llama 2: open foundation and fine-tuned chat models [EB/OL]. (2023−07−19) [2025−10−17]. https://arxiv.org/abs/2307.09288.
41	MARAFIOTI A, ZOHAR O, FARRÉ M, et al. Smolvlm: redefining small and efficient multimodal models [EB/OL]. (2025−04−07) [2025−10−17]. https://arxiv.org/abs/2504.05299.
42	ALLAL L B, LOZHKOV A, BAKOUCH E, et al. SmolLM2: when smol goes big--data-centric training of a small language model [EB/OL]. (2025−02−24) [2025−10−17]. https://arxiv.org/abs/2502.02737.
43	M87 LABS, INC. Moondream[EB/OL]. (2025−03−27) [2025−10−17]. https://moondream.ai/.
44	LI Y, BUBECK S, ELDAN R, et al. Textbooks are all you need ii: phi-1.5 technical report [EB/OL]. (2023−09−11) [2025−10−17]. https://arxiv.org/abs/2309.05463.
45	ZHU J, WANG W, CHEN Z, et al. Internvl3: exploring advanced training and test-time recipes for open-source multimodal models [EB/OL]. (2025−04−19) [2025−10−17]. https://arxiv.org/abs/2504.10479.
46	TEAM K, DU A, YIN B, et al. Kimi-vl technical report [EB/OL]. (2025−06−23) [2025−10−17]. https://arxiv.org/abs/2504.07491.
47	TEAM G, KAMATH A, FERRET J, et al. Gemma 3n model overview [EB/OL]. (2025−06−30) [2025−10−17]. https://ai.google.dev/gemma/docs/gemma-3n.
48	DEVVRIT F, KUDUGUNTA S, KUSUPATI A, et al. Matformer: nested transformer for elastic inference [C]// Advances in Neural Information Processing Systems. Vancouver: Curran Associates, 2024: 140535−140564.
49	XIONG B, CHEN B, WANG C, et al. BlueLM-2.5-3B technical report [EB/OL]. (2025−07−08) [2025−10−17]. https://arxiv.org/abs/2507.05934.
50	YU T, WANG Z, WANG C, et al. Minicpm-v 4.5: cooking efficient mllms via architecture, data, and training recipe[EB/OL]. (2025−09−16) [2025−10−17]. https://arxiv.org/abs/2509.18154.
51	LI J, LI D, SAVARESE S, et al. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models [C]// International Conference on Machine Learning. Hawaii: JMLR, 2023: 19730−19742.
52	LIU H, LI C, WU Q, et al. Visual instruction tuning [C]// Advances in Neural Information Processing Systems. New Orleans: Curran Associates, 2023: 34892−34916.
53	ZHANG P, ZENG G, WANG T, et al. Tinyllama: an open-source small language model [EB/OL]. (2024−07−04) [2025−10−17]. https://arxiv.org/abs/2401.02385.
54	ZHANG S, FANG Q, YANG Z, et al. Llava-mini: efficient image and video large multimodal models with one vision token [EB/OL]. (2025−03−02) [2025−10−17]. https://arxiv.org/abs/2501.03895.
55	SHAO K, TAO K, ZHANG K, et al. When tokens talk too much: a survey of multimodal long-context token compression across images, videos, and audios [EB/OL]. (2025−08−28) [2025−10−17]. https://arxiv.org/abs/2507.20198.
56	GAO Z, CHEN Z, CUI E, et al Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance[J]. Visual Intelligence, 2024, 2 (1): 32 doi: 10.1007/s44267-024-00067-6
57	LI B, ZHANG Y, GUO D, et al. Llava-onevision: easy visual task transfer [EB/OL]. (2024−10−26) [2025−10−17]. https://arxiv.org/abs/2408.03326.
58	BAI S, CHEN K, LIU X, et al. Qwen2.5-vl technical report [EB/OL]. (2025−02−19) [2025−10−17]. https://arxiv.org/abs/2502.13923.
59	WANG H, YU Z, SPADARO G, et al. Folder: Accelerating multi-modal large language models with enhanced performance [EB/OL]. (2025−04−10) [2025−10−17]. https://arxiv.org/abs/2501.02430.
60	SUN B, ZHANG Y, JIANG S, et al. Hybrid pixel-unshuffled network for lightweight image super-resolution [C]// Proceedings of the AAAI conference on artificial intelligence. Washington DC: AAAI Press, 2023: 2375−2383.
61	CHEN L, ZHAO H, LIU T, et al. An image is worth 1/2 tokens after layer 2: plug-and-play inference acceleration for large vision-language models [C]// European Conference on Computer Vision. Milan: Springer Nature Switzerland, 2024: 19−35.
62	HAN Y, LIU X, DING P, et al. Rethinking token reduction in mllms: towards a unified paradigm for training-free acceleration [EB/OL]. (2024−12−04) [2025−10−17]. https://arxiv.org/html/2411.17686v2.
63	ZHAO Z, LI Y, LI Y. Learning free token reduction for multi-modal large language models [EB/OL]. (2025−09−30) [2025−10−17]. https://arxiv.org/abs/2501.17391.
64	HEO B, PARK S, HAN D, et al. Rotary position embedding for vision transformer [C]// European Conference on Computer Vision. Milan: Springer Nature Switzerland, 2024: 289−305.
65	LI W, ZHOU H, YU J, et al. Coupled mamba: enhanced multimodal fusion with coupled state space model [C]// Advances in Neural Information Processing Systems. Vancouver: Curran Associates, 2024: 59808−59832.
66	HU Y, FAN Z, WANG X, et al. TinyAlign: boosting lightweight vision-language models by mitigating modal alignment bottlenecks [EB/OL]. (2025−06−30) [2025−10−17]. https://arxiv.org/abs/2505.12884.
67	META. Llama 3.2: revolutionizing edge AI and vision with open, customizable models [EB/OL]. (2024−09−25) [2025−10−17]. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/.
68	TEAM Q. Qwen1.5 [EB/OL]. (2024−02−05) [2025−10−17]. https://github.com/QwenLM/Qwen1.5.
69	YANG A, LI A, YANG B, et al. Qwen3 technical report [EB/OL]. (2025−05−14) [2025−10−17]. https://arxiv.org/abs/2505.09388.
70	CHIANG W L, LI Z, LIN Z, et al. Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality [EB/OL]. (2023−04−14) [2025−10−17]. https://vicuna. lmsys.org.
71	GUNASEKAR S, ZHANG Y, ANEJA J, et al. Textbooks are all you need [EB/OL]. (2023−10−02) [2025−10−17]. https://arxiv.org/abs/2306.11644.
72	WU Z, HUANG S, ZHOU Z, et al. InternLM2. 5-stepprover: advancing automated theorem proving via critic-guided search [C]// 2nd AI for Math Workshop@ ICML 2025. Vancouver: PmLR, 2025.
73	BELLAGENTE M, TOW J, MAHAN D, et al. Stable LM 2 1.6B technical report [EB/OL]. (2024−02−27) [2025−10−17]. https://arxiv.org/abs/2402.17834.
74	MATHEW M, KARATZAS D, JAWAHAR C V. Docvqa: a dataset for vqa on document images [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. [S. l.]: IEEE, 2021: 2200−2209.
75	KAZEMZADEH S, ORDONEZ V, MATTEN M, et al. Referitgame: referring to objects in photographs of natural scenes [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for Computational Linguistics, 2014: 787−798.
76	TEAM Q. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond [EB/OL]. (2023−10−13) [2025−10−17]. https://arxiv.org/abs/2308.12966.
77	JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts [EB/OL]. (2024−01−08) [2025−10−17]. https://arxiv.org/abs/2401.04088.
78	LIN B, TANG Z, YE Y, et al. Moe-llava: mixture of experts for large vision-language models [EB/OL]. (2024−12−23) [2025−10−17]. https://arxiv.org/abs/2401.15947.
79	YUE Y, WANG Y, KANG B, et al. Deer-vla: dynamic inference of multimodal large language models for efficient robot execution [C]// Advances in Neural Information Processing Systems. Vancouver: Curran Associates, 2024: 56619−56643.
80	FENG Q, LI W, LIN T, et al. Align-KD: distilling cross-modal alignment knowledge for mobile vision-language large model enhancement [C]// Proceedings of the Computer Vision and Pattern Recognition Conference. Nashville: IEEE, 2025: 4178−4188.
81	KOSKA B, HORVÁTH M. Towards multi-modal mastery: a 4.5 B parameter truly multi-modal small language model [C]// 2024 2nd International Conference on Foundation and Large Language Models (FLLM). [S. l.]: IEEE, 2024: 587−592.
82	LIN H, BAI H, LIU Z, et al. Mope-clip: structured pruning for efficient vision-language models with module-wise pruning error metric [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 27370−27380.
83	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// International conference on machine learning. [S. l.]: PmLR, 2021: 8748−8763.
84	NING Z, ZHAO J, JIN Q, et al. Inf-MLLM: efficient streaming inference of multimodal large language models on a single GPU [EB/OL]. (2024−09−11) [2025−10−17]. https://arxiv.org/abs/2409.09086.
85	HAN I, ZHANG Z, WANG Z, et al. CalibQuant: 1-Bit KV cache quantization for multimodal LLMs [EB/OL]. (2025−03−24) [2025−10−17]. https://arxiv.org/abs/2502.14882.
86	GAGRANI M, GOEL R, JEON W, et al. On speculative decoding for multimodal large language models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 8285−8289.
87	BAI K, YE L, HUANG R, et al. EdgeMM: multi-core CPU with heterogeneous AI-extension and activation-aware weight pruning for multimodal LLMs at edge [EB/OL]. (2025−05−16) [2025−10−17]. https://arxiv.org/abs/2505.10782.
88	Dimensity 9300 NPU [EB/OL]. (2024−12−23) [2025−10−17]. https://www.mediatek.com/products/smartphones/mediatek-dimensity-9300.
89	MLC TEAM. MLC LLM [EB/OL]. (2025−10−10) [2025−10−17]. https://github.com/mlc-ai/mlc-llm.
90	LV C, NIU C, GU R, et al. Walle: an end-to-end, general-purpose, and large-scale production system for device-cloud collaborative machine learning [C]// 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad: USENIX Association, 2022: 249−265.
91	KWON W, LI Z, ZHUANG S, et al. Efficient memory management for large language model serving with pagedattention [C]// Proceedings of the 29th symposium on operating systems principles. Koblenz: Association for Computing Machinery, 2023: 611−626.
92	GEORGI G. llama. cpp [EB/OL]. [2025−10−17]. https://github.com/ggml-org/llama.cpp.
93	NVIDIA DEVELOPER. Jetson modules, support, ecosystem, and lineup [EB/OL]. [2025−10−17]. https://developer.nvidia.com/embedded/jetson-modules.
94	TVM TEAM. Apache TVM [EB/OL]. (2025−10−10) [2025−10−17]. https://tvm.apache.ac.cn/.
95	ABID A, ABDALLA A, ABID A, et al. Gradio: Hassle-free sharing and testing of ml models in the wild [EB/OL]. (2019−06−06) [2025−10−17]. https://arxiv.org/abs/1906.02569.
96	RJOUB G, ELMEKKI H, ISLAM S, et al A hybrid swarm intelligence approach for optimizing Multimodal Large Language Models deployment in edge-cloud-based Federated Learning environments[J]. Computer Communications, 2025, 237 (C): 108152 doi: 10.1016/j.comcom.2025.108152
97	HU Y, YE D, KANG J, et al A cloud-edge collaborative architecture for multimodal LLMS-based advanced driver assistance systems in IOT networks[J]. IEEE Internet of Things Journal, 2025, 12 (10): 13208- 13221 doi: 10.1109/JIOT.2024.3509628
98	HONG W, WANG W, DING M, et al. Cogvlm2: visual language models for image and video understanding [EB/OL]. (2024−08−29) [2025−10−17]. https://arxiv.org/abs/2408.16500.
99	ACHIAM J, ADLER S, AGARWAL S, et al. Gpt-4 technical report[EB/OL]. (2024−03−04) [2025−10−17]. https://arxiv.org/abs/2303.08774.
100	GAO Z, ZHANG B, LI P, et al. Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage [C]// 2025 IEEE International Conference on Learning Representation. Singapore: IEEE, 2025.
101	ZIRUI S, YAOHANG L, MENG F, et al. Mmac-copilot: multi-modal agent collaboration operating system copilot [EB/OL]. (2025−03−23) [2025−10−17]. https://arxiv.org/abs/2404.18074.
102	ZHANG C, YANG Z, LIU J, et al. Appagent: Multimodal agents as smartphone users [C]// Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Yokohama: Association for Computing Machinery, 2025.
103	LI Y, ZHANG C, YANG W, et al. Appagent v2: advanced agent for flexible mobile interactions [EB/OL]. (2025−09−17) [2025−10−17]. https://arxiv.org/abs/2408.11824.
104	YI B, HU X, CHEN Y, et al. EcoAgent: an efficient edge-cloud collaborative multi-agent framework for mobile automation [EB/OL]. (2025−05−09) [2025−10−17]. https://arxiv.org/abs/2505.05440.
105	WANG J, XU H, YE J, et al. Mobile-agent: autonomous multi-modal mobile device agent with visual perception [EB/OL]. (2024−04−18) [2025−10−17]. https://arxiv.org/abs/2401.16158.
106	LIU S, ZENG Z, REN T, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection [C]// European Conference on Computer Vision. Milan: Springer Nature Switzerland, 2024: 38−55.
107	XU Z, ZHANG Y, XIE E, et al Drivegpt4: interpretable end-to-end autonomous driving via large language model[J]. IEEE Robotics and Automation Letters, 2024, 9 (10): 8186- 8193 doi: 10.1109/LRA.2024.3440097
108	XU Z, BAI Y, ZHANG Y, et al. DriveGPT4-V2: harnessing large language model capabilities for enhanced closed-loop autonomous driving [C]// Proceedings of the Computer Vision and Pattern Recognition Conference. Nashville: IEEE, 2025: 17261−17270.
109	ZHENG Y, XING Z, ZHANG Q, et al. Planagent: a multi-modal large language agent for closed-loop vehicle motion planning [EB/OL]. (2024−06−04) [2025−10−17]. https://arxiv.org/abs/2406.01587.
110	ONG X, DING P, FAN Y, et al. Quart-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning [C]// 2025 IEEE International Conference on Robotics and Automation (ICRA). Atlanta: IEEE, 2025: 9533−9539.
111	YAN F, LIU F, ZHENG L, et al. Robomm: all-in-one multimodal large model for robotic manipulation [EB/OL]. (2024−12−10) [2025−10−17]. https://arxiv.org/abs/2412.07215.
112	LIU J, LI C, WANG G, et al. Self-corrected multimodal large language model for end-to-end robot manipulation [EB/OL]. (2024−05−27) [2025−10−17]. https://arxiv.org/html/2405.17418v1.
113	TLUO G, YANG G, GONG Z, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces [EB/OL]. (2025−05−30) [2025−10−17]. https://arxiv.org/abs/2506.00123.
114	CHEN J, LIANG H, DU L, et al. OWMM-Agent: open world mobile manipulation with multi-modal agentic data synthesis [EB/OL]. (2025−06−21) [2025−10−17]. https://arxiv.org/abs/2506.04217.

[1]	蒋沁诚,陶建峰,王洋洋,张宇磊,刘成良. 基于EWT-LSTM的工业机器人关节异常检测[J]. 浙江大学学报(工学版), 2025, 59(5): 982-994.
[2]	肖永清,卢榆博,杨星盟,魏建威,苏琳,郝欣健,余官定. 多层边缘计算网络中任务卸载与资源分配联合优化[J]. 浙江大学学报(工学版), 2025, 59(12): 2506-2515.
[3]	汤佳伟,郭铁铮,闻英友. 基于强化学习的Kubernetes云边协同计算调度算法[J]. 浙江大学学报(工学版), 2025, 59(11): 2400-2408.
[4]	曾耀平,刘月强,关赛莘,江伟伟,夏玉婷. 能量收集下的D2D-MEC计算卸载[J]. 浙江大学学报(工学版), 2024, 58(5): 967-978.
[5]	赵蕴龙,赵敏喆,朱文强,查星宇. 基于轻量化迁移学习的云边协同自然语言处理方法[J]. 浙江大学学报(工学版), 2024, 58(12): 2531-2539.
[6]	刘雪娇,宋庆武,夏莹杰. 基于区块链的车联网矩阵计算安全卸载方案[J]. 浙江大学学报(工学版), 2023, 57(1): 144-154.
[7]	陈中,徐晓,王海伟,罗宏浩,陈轩. 基于备用边缘节点的居民区用电任务卸载优化策略[J]. 浙江大学学报(工学版), 2021, 55(5): 917-926.
[8]	齐平,束红. 智慧医疗场景下考虑终端移动性的任务卸载策略[J]. 浙江大学学报(工学版), 2020, 54(6): 1126-1137.

Viewed

Full text

Abstract

Cited

Shared

Discussed