动态环境无人机导航的安全分层强化学习框架

doi:10.3785/j.issn.1008-973X.2026.06.011

浙江大学学报(工学版)

2026, Vol. 60

Issue (6): 1240-1250 DOI: 10.3785/j.issn.1008-973X.2026.06.011

计算机技术

动态环境无人机导航的安全分层强化学习框架

商益铭(

),杜昌平*(

),杨睿,方天睿,杜泽安,郑耀

浙江大学航空航天学院，浙江杭州 310027

Safe hierarchical reinforcement learning framework for dynamic UAV navigation

Yiming SHANG(

),Changping DU*(

),Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG

School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China

全文: PDF(1760 KB) HTML

摘要：

针对无人机在复杂动态环境中导航和避障的问题，提出安全分层智能探索学习（SHIELD）框架. 该框架为4层递进式安全保障架构. 1）强化学习决策层负责全局路径规划. 2）专家指导层通过改进的动态窗口法，优化局部路径. 3）安全保障层结合人工势场法和控制屏障函数，提供紧急安全约束. 4）原始-对偶优化层通过柔性优化机制，优化长期策略. 设计动态自适应奖励函数，根据环境复杂度和任务进度自适应调整奖励权重. 结果表明，SHIELD在复杂动态环境中的任务成功率达到95.7%，路径效率达到0.962，较强化学习基线算法提升48.8%和30.2%，较3种传统对比算法平均提升55.0%和36.0%，有效提升了无人机在动态环境中的导航安全性和效率.

关键词： 无人机(UAV); 强化学习; 动态避障规划; 控制屏障函数; 原始-对偶优化; 动态窗口法

Abstract:

The safe hierarchical intelligent exploration learning (SHIELD) framework was proposed in order to address the problem of UAV navigation and obstacle avoidance in complex dynamic environment. The framework comprised a four-layer progressive safety-assurance architecture. 1) The reinforcement learning decision-making layer was responsible for global path planning. 2) The expert guidance layer optimized local path via an improved dynamic window approach. 3) The safety assurance layer combined artificial potential field method and control barrier function in order to provide emergency safety constraint. 4) The primal–dual optimization layer optimized long-term policy through a flexible optimization mechanism. A dynamic adaptive reward function was designed, in which the reward weight was adaptively adjusted according to environmental complexity and task progress. Results showed that SHIELD achieved a task success rate of 95.7% and a path efficiency of 0.962 in complex dynamic environment, representing improvement of 48.8% and 30.2% over the reinforcement learning baseline algorithm, and average improvement of 55.0% and 36.0% over three traditional comparative algorithms. The safety and efficiency of UAV navigation in dynamic environment were effectively enhanced.

Key words: unmanned aerial vehicle (UAV) reinforcement learning dynamic obstacle avoidance planning control barrier function primal-dual optimization dynamic window approach

收稿日期: 2025-08-24 出版日期: 2026-05-06

CLC:

V 279

通讯作者: 杜昌平 E-mail: 22424059@zju.edu.cn;duchangping@zju.edu.cn

作者简介: 商益铭（2003—），男，硕士生，从事无人机路径规划研究. orcid.org/0009-0004-1290-4610. E-mail：22424059@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	商益铭
	杜昌平
	杨睿
	方天睿
	杜泽安
	郑耀

引用本文:

商益铭,杜昌平,杨睿,方天睿,杜泽安,郑耀. 动态环境无人机导航的安全分层强化学习框架[J]. 浙江大学学报(工学版), 2026, 60(6): 1240-1250.

Yiming SHANG,Changping DU,Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG. Safe hierarchical reinforcement learning framework for dynamic UAV navigation. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1240-1250.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.06.011 或 https://www.zjujournals.com/eng/CN/Y2026/V60/I6/1240

图 1 无人机路径规划的环境模型

表 1 不同等级下的威胁权重、障碍物密度权重

表 2 自适应奖励塑形下的权重分配

图 2 SHIELD的整体架构

表 3 SAC算法的主要超参数

图 3 PDO算法的执行流程

表 4 仿真实验中的主要控制参数与环境参数

图 4 奖励函数曲线

图 5 累积成功率曲线

图 6 完整框架实验中无人机的路径

表 5 所提算法与传统算法在成功率和路径效率上的对比

图 7 3种传统算法的无人机路径对比

表 6 消融实验中的成功率、路径效率、平均转角和平均决策时间的对比

表 7 不同障碍物配置下的成功率和路径效率对比

表 8 不同干扰下的成功率、路径效率和平均决策时间对比

图 8 多干扰条件下的无人机路径

1	LI Y, ZENG Q, SHAO C, et al UAV localization method with keypoints on the edges of semantic objects for low-altitude economy[J]. Drones, 2024, 9 (1): 14 doi: 10.3390/drones9010014
2	WANG Z, XIANG X. Improved Astar algorithm for path planning of marine robot [C]//Proceedings of the 37th Chinese Control Conference. Wuhan: IEEE, 2018: 5410-5414.
3	QI J, YANG H, SUN H MOD-RRT*: a sampling-based algorithm for robot path planning in dynamic environment[J]. IEEE Transactions on Industrial Electronics, 2021, 68 (8): 7244- 7251 doi: 10.1109/TIE.2020.2998740
4	YANG Y, CHEN Z Optimization of dynamic obstacle avoidance path of multirotor UAV based on ant colony algorithm[J]. Wireless Communications and Mobile Computing, 2022, (1): 1299434
5	SHORAKAEI H, VAHDANI M, IMANI B, et al Optimal cooperative path planning of unmanned aerial vehicles by a parallel genetic algorithm[J]. Robotica, 2016, 34 (4): 823- 836 doi: 10.1017/S0263574714001878
6	YU Z, SI Z, LI X, et al A novel hybrid particle swarm optimization algorithm for path planning of UAVs[J]. IEEE Internet of Things Journal, 2022, 9 (22): 22547- 22558 doi: 10.1109/JIOT.2022.3182798
7	AZAR A T, KOUBAA A, MOHAMED N A, et al Drone deep reinforcement learning: a review[J]. Electronics, 2021, 10 (9): 999 doi: 10.3390/electronics10090999
8	OUBBATI O S, ATIQUZZAMAN M, BAZ A, et al Dispatch of UAVs for urban vehicular networks: a deep reinforcement learning approach[J]. IEEE Transactions on Vehicular Technology, 2021, 70 (12): 13174- 13189 doi: 10.1109/TVT.2021.3119070
9	SONNY A, YEDURI S R, CENKERAMADDI L R Q-learning-based unmanned aerial vehicle path planning with dynamic obstacle avoidance[J]. Applied Soft Computing, 2023, 147: 110773 doi: 10.1016/j.asoc.2023.110773
10	LI D, YIN W, WONG W E, et al Quality-oriented hybrid path planning based on A* and Q-learning for unmanned aerial vehicle[J]. IEEE Access, 2021, 10: 7664- 7674 doi: 10.1109/access.2021.3139534
11	THOMAS P S, DA SILVA B C, BARTO A G, et al Preventing undesirable behavior of intelligent machines[J]. Science, 2019, 366 (6468): 999- 1004 doi: 10.1126/science.aag3311
12	HE Y, HOU T, WANG M A new method for unmanned aerial vehicle path planning in complex environments[J]. Scientific Reports, 2024, 14: 9257 doi: 10.1038/s41598-024-60051-4
13	XU L, XI M, GAO R, et al Dynamic path planning of UAV with least inflection point based on adaptive neighborhood A* algorithm and multi-strategy fusion[J]. Scientific Reports, 2025, 15: 8563 doi: 10.1038/s41598-025-92406-w
14	HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [EB/OL]. [2025-08-10]. https://arxiv.org/abs/1801.01290.
15	FOX D, BURGARD W, THRUN S The dynamic window approach to collision avoidance[J]. IEEE Robotics and Automation Magazine, 1997, 4 (1): 23- 33 doi: 10.1109/100.580977
16	KHATIB O. Real-time obstacle avoidance for manipulators and mobile robots [M]//Autonomous robot vehicles. New York: Springer, 1990: 396–404.
17	MATOUI F, BOUSSAID B, ABDELKRIM M N. Local minimum solution for the potential field method in multiple robot motion planning task [C]//Proceedings of the 16th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering. Monastir: IEEE, 2016: 452–457.
18	ZENG J, ZHANG B, SREENATH K. Safety-critical model predictive control with discrete-time control barrier function [C]//Proceedings of the American Control Conference. New Orleans: IEEE, 2021: 3882–3889.

[1]	汪洋,刘红超,田池,吴兵,张笛. 航线交换机制下多船避碰的策略学习与博弈决策[J]. 浙江大学学报(工学版), 2026, 60(5): 964-976.
[2]	杨青青,唐润朋,彭艺. 通信感知一体化系统中的联合波形与相移设计[J]. 浙江大学学报(工学版), 2026, 60(4): 906-914.
[3]	高洪伟,尚秉旭,张鑫康,王洪峰,何维,裴晓飞. 基于可达集和强化学习的智能汽车决策规划[J]. 浙江大学学报(工学版), 2025, 59(9): 1996-2004.
[4]	翟亚红,陈雅玲,徐龙艳,龚玉. 改进YOLOv8s的轻量级无人机航拍小目标检测算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1708-1717.
[5]	柳佳乐,薛雅丽,崔闪,洪君. 动态窗口法引导的TD3无地图导航算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1671-1679.
[6]	郝琨,孟璇,赵晓芳,李志圣. 融合自适应势场法和深度强化学习的三维水下AUV路径规划方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1451-1461.
[7]	赵威,张万枝,侯加林,侯瑞,李玉华,赵乐俊,程进. 基于改进深度强化学习算法的农业机器人路径规划[J]. 浙江大学学报(工学版), 2025, 59(7): 1492-1503.
[8]	李颂元,朱祥维,李玺. 基座模型技术背景下的具身智能体综述[J]. 浙江大学学报(工学版), 2025, 59(2): 213-226.
[9]	汤佳伟,郭铁铮,闻英友. 基于强化学习的Kubernetes云边协同计算调度算法[J]. 浙江大学学报(工学版), 2025, 59(11): 2400-2408.
[10]	张名芳,马健,赵娜乐,王力,刘颖. 无信号交叉口处基于深度强化学习的智能网联车辆运动规划[J]. 浙江大学学报(工学版), 2024, 58(9): 1923-1934.
[11]	叶宝林,孙瑞涛,吴维敏,陈滨,姚青. 基于异步优势演员-评论家的交通信号控制方法[J]. 浙江大学学报(工学版), 2024, 58(8): 1671-1680.
[12]	张会娟,李坤鹏,姬淼鑫,刘振江,刘建娟,张弛. 基于空间相关性增强的无人机检测算法[J]. 浙江大学学报(工学版), 2024, 58(3): 468-479.
[13]	王卓,李永强,冯宇,冯远静. 两方零和马尔科夫博弈策略梯度算法及收敛性分析[J]. 浙江大学学报(工学版), 2024, 58(3): 480-491.
[14]	刘宇庭,郭世杰,唐术锋,张学炜,李田田. *改进A与ROA-DWA融合的机器人路径规划**[J]. 浙江大学学报(工学版), 2024, 58(2): 360-369.
[15]	王义娜,曹晨,杨佳琪,俞彦军,傅国强,王硕玉. 考虑个体习惯的轮椅机器人人机共享避障方法[J]. 浙江大学学报(工学版), 2024, 58(11): 2299-2308.

Viewed

Full text

Abstract

Cited

Shared

Discussed