Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (6): 1240-1250    DOI: 10.3785/j.issn.1008-973X.2026.06.011
计算机技术     
动态环境无人机导航的安全分层强化学习框架
商益铭(),杜昌平*(),杨睿,方天睿,杜泽安,郑耀
浙江大学 航空航天学院,浙江 杭州 310027
Safe hierarchical reinforcement learning framework for dynamic UAV navigation
Yiming SHANG(),Changping DU*(),Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG
School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China
 全文: PDF(1760 KB)   HTML
摘要:

针对无人机在复杂动态环境中导航和避障的问题,提出安全分层智能探索学习(SHIELD)框架. 该框架为4层递进式安全保障架构. 1)强化学习决策层负责全局路径规划. 2)专家指导层通过改进的动态窗口法,优化局部路径. 3)安全保障层结合人工势场法和控制屏障函数,提供紧急安全约束. 4)原始-对偶优化层通过柔性优化机制,优化长期策略. 设计动态自适应奖励函数,根据环境复杂度和任务进度自适应调整奖励权重. 结果表明,SHIELD在复杂动态环境中的任务成功率达到95.7%,路径效率达到0.962,较强化学习基线算法提升48.8%和30.2%,较3种传统对比算法平均提升55.0%和36.0%,有效提升了无人机在动态环境中的导航安全性和效率.

关键词: 无人机(UAV)强化学习动态避障规划控制屏障函数原始-对偶优化动态窗口法    
Abstract:

The safe hierarchical intelligent exploration learning (SHIELD) framework was proposed in order to address the problem of UAV navigation and obstacle avoidance in complex dynamic environment. The framework comprised a four-layer progressive safety-assurance architecture. 1) The reinforcement learning decision-making layer was responsible for global path planning. 2) The expert guidance layer optimized local path via an improved dynamic window approach. 3) The safety assurance layer combined artificial potential field method and control barrier function in order to provide emergency safety constraint. 4) The primal–dual optimization layer optimized long-term policy through a flexible optimization mechanism. A dynamic adaptive reward function was designed, in which the reward weight was adaptively adjusted according to environmental complexity and task progress. Results showed that SHIELD achieved a task success rate of 95.7% and a path efficiency of 0.962 in complex dynamic environment, representing improvement of 48.8% and 30.2% over the reinforcement learning baseline algorithm, and average improvement of 55.0% and 36.0% over three traditional comparative algorithms. The safety and efficiency of UAV navigation in dynamic environment were effectively enhanced.

Key words: unmanned aerial vehicle (UAV)    reinforcement learning    dynamic obstacle avoidance planning    control barrier function    primal-dual optimization    dynamic window approach
收稿日期: 2025-08-24 出版日期: 2026-05-06
CLC:  V 279  
通讯作者: 杜昌平     E-mail: 22424059@zju.edu.cn;duchangping@zju.edu.cn
作者简介: 商益铭(2003—),男,硕士生,从事无人机路径规划研究. orcid.org/0009-0004-1290-4610. E-mail:22424059@zju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
商益铭
杜昌平
杨睿
方天睿
杜泽安
郑耀

引用本文:

商益铭,杜昌平,杨睿,方天睿,杜泽安,郑耀. 动态环境无人机导航的安全分层强化学习框架[J]. 浙江大学学报(工学版), 2026, 60(6): 1240-1250.

Yiming SHANG,Changping DU,Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG. Safe hierarchical reinforcement learning framework for dynamic UAV navigation. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1240-1250.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.06.011        https://www.zjujournals.com/eng/CN/Y2026/V60/I6/1240

图 1  无人机路径规划的环境模型
等级$ {W}_{\text{thr}} $$ {W}_{\text{den}} $
DWACBFDWACBF
high2.02.51.51.8
medium1.51.81.01.0
low1.01.20.70.9
none0.50.80.50.7
表 1  不同等级下的威胁权重、障碍物密度权重
奖励组件GlobalSafetyApproach
Goal1.01.01.0
Direction1.80.82.2
Collision1.01.01.0
Boundary1.01.01.0
Static0.51.81.2
Dynamic0.62.21.0
Deviation0.71.50.8
Speed2.50.60.9
Energy0.41.41.6
Time0.61.21.6
DWA0.62.81.8
CBF0.82.01.6
表 2  自适应奖励塑形下的权重分配
图 2  SHIELD的整体架构
超参数数值超参数数值
折扣因子0.99经验回放缓冲区大小100 000
学习率0.000 5批次大小256
软更新系数0.01总训练步数100 000
表 3  SAC算法的主要超参数
图 3  PDO算法的执行流程
参数数值参数数值参数数值参数数值
$ {r}_{\text{UAV}} $/m1.0$ {n}_{\text{dyn}} $2$ {W}_{\text{AC}} $0.5$ {W}_{\text{cbf}} $0.3
$ {r}_{\mathrm{g}} $/m3.0$ {W}_{\mathrm{g}} $200$ \Delta t $/s0.1$ {W}_{0} $0.3
$ {r}_{\min } $/m2$ {W}_{\text{dir}} $1.8$ {d}_{\text{fa}\mathrm{r}} $/m70$ {\alpha }_{\tan } $0.8
$ {r}_{\max } $/m6$ {W}_{\text{col}} $20$ {d}_{\text{nar}} $/m25$ {d}_{\text{bnd}} $/m0.8
$ {r}_{\text{sen}} $/m10$ {W}_{\text{bnd}} $20$ \alpha $0.8$ {\lambda }_{\text{smo}} $0.3
$ {v}_{\max } $/(m·s?1)5.0$ {W}_{\text{sta}} $1.0$ {\beta }_{\text{DWA}} $1.2$ {C}_{\text{stp}} $2
$ {\omega }_{\max } $/(rad·s?1)$ \text{π} /2 $$ {W}_{\text{dyn}} $2.0$ \gamma $0.15$ {C}_{\text{ep}} $15
$ {v}_{\mathrm{o}1} $/(m·s?1)1.0$ {W}_{\text{dev}} $1.5$ \delta $0.02$ {\alpha }_{\text{stp}} $0.008
$ {v}_{\mathrm{o}2} $/(m·s?1)2.5$ {W}_{\text{spd}} $1.2$ \varepsilon $0.6$ {\alpha }_{\text{ep}} $0.01
$ {\theta }_{\text{fov}} $/(°)120$ {W}_{\text{eng}} $1.8$ {d}_{\text{TH}} $/m20$ {\beta }_{\text{PDO}} $0.85
$ {N}_{\max } $8$ {W}_{\text{tim}} $0.3$ {v}_{\text{TH}} $/(m·s?1)0.3$ {\lambda }_{\max } $20
$ {n}_{\text{sta}} $6$ {W}_{\text{DWA}} $1.5$ {t}_{\text{TH}} $/s6$ {D}_{\mathrm{s}} $/m3
表 4  仿真实验中的主要控制参数与环境参数
图 4  奖励函数曲线
图 5  累积成功率曲线
图 6  完整框架实验中无人机的路径
算法Sc/%E
APF51.60.738
RRT62.80.665
DWA75.40.716
所提算法95.70.962
表 5  所提算法与传统算法在成功率和路径效率上的对比
图 7  3种传统算法的无人机路径对比
序号+DWA+CBF+PDOSc/%E$ {\theta }_{\text{avg}} $/(°)$ {T}_{\text{avg}} $/ms
064.30.73921.411.87
178.80.7799.992.41
285.60.89317.722.03
375.20.75213.751.88
490.30.92212.702.54
583.40.8078.932.43
689.90.90814.822.11
795.70.9629.412.57
表 6  消融实验中的成功率、路径效率、平均转角和平均决策时间的对比
障碍物数量配置(静+动)Sc/%E
10+495.70.962
12+592.30.931
15+689.60.908
表 7  不同障碍物配置下的成功率和路径效率对比
算法与条件Sc/%E$ {T}_{\text{avg}} $/ms
无干扰95.70.9622.57
单阵风干扰81.60.8632.61
单雷达噪声95.30.9622.57
单GPS误差95.60.9612.57
多干扰APF31.80.5931.06
多干扰RRT38.70.5122.07
多干扰DWA55.30.6382.49
多干扰,本文方法81.40.8612.86
表 8  不同干扰下的成功率、路径效率和平均决策时间对比
图 8  多干扰条件下的无人机路径
1 LI Y, ZENG Q, SHAO C, et al UAV localization method with keypoints on the edges of semantic objects for low-altitude economy[J]. Drones, 2024, 9 (1): 14
doi: 10.3390/drones9010014
2 WANG Z, XIANG X. Improved Astar algorithm for path planning of marine robot [C]//Proceedings of the 37th Chinese Control Conference. Wuhan: IEEE, 2018: 5410-5414.
3 QI J, YANG H, SUN H MOD-RRT*: a sampling-based algorithm for robot path planning in dynamic environment[J]. IEEE Transactions on Industrial Electronics, 2021, 68 (8): 7244- 7251
doi: 10.1109/TIE.2020.2998740
4 YANG Y, CHEN Z Optimization of dynamic obstacle avoidance path of multirotor UAV based on ant colony algorithm[J]. Wireless Communications and Mobile Computing, 2022, (1): 1299434
5 SHORAKAEI H, VAHDANI M, IMANI B, et al Optimal cooperative path planning of unmanned aerial vehicles by a parallel genetic algorithm[J]. Robotica, 2016, 34 (4): 823- 836
doi: 10.1017/S0263574714001878
6 YU Z, SI Z, LI X, et al A novel hybrid particle swarm optimization algorithm for path planning of UAVs[J]. IEEE Internet of Things Journal, 2022, 9 (22): 22547- 22558
doi: 10.1109/JIOT.2022.3182798
7 AZAR A T, KOUBAA A, MOHAMED N A, et al Drone deep reinforcement learning: a review[J]. Electronics, 2021, 10 (9): 999
doi: 10.3390/electronics10090999
8 OUBBATI O S, ATIQUZZAMAN M, BAZ A, et al Dispatch of UAVs for urban vehicular networks: a deep reinforcement learning approach[J]. IEEE Transactions on Vehicular Technology, 2021, 70 (12): 13174- 13189
doi: 10.1109/TVT.2021.3119070
9 SONNY A, YEDURI S R, CENKERAMADDI L R Q-learning-based unmanned aerial vehicle path planning with dynamic obstacle avoidance[J]. Applied Soft Computing, 2023, 147: 110773
doi: 10.1016/j.asoc.2023.110773
10 LI D, YIN W, WONG W E, et al Quality-oriented hybrid path planning based on A* and Q-learning for unmanned aerial vehicle[J]. IEEE Access, 2021, 10: 7664- 7674
doi: 10.1109/access.2021.3139534
11 THOMAS P S, DA SILVA B C, BARTO A G, et al Preventing undesirable behavior of intelligent machines[J]. Science, 2019, 366 (6468): 999- 1004
doi: 10.1126/science.aag3311
12 HE Y, HOU T, WANG M A new method for unmanned aerial vehicle path planning in complex environments[J]. Scientific Reports, 2024, 14: 9257
doi: 10.1038/s41598-024-60051-4
13 XU L, XI M, GAO R, et al Dynamic path planning of UAV with least inflection point based on adaptive neighborhood A* algorithm and multi-strategy fusion[J]. Scientific Reports, 2025, 15: 8563
doi: 10.1038/s41598-025-92406-w
14 HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [EB/OL]. [2025-08-10]. https://arxiv.org/abs/1801.01290.
15 FOX D, BURGARD W, THRUN S The dynamic window approach to collision avoidance[J]. IEEE Robotics and Automation Magazine, 1997, 4 (1): 23- 33
doi: 10.1109/100.580977
16 KHATIB O. Real-time obstacle avoidance for manipulators and mobile robots [M]//Autonomous robot vehicles. New York: Springer, 1990: 396–404.
17 MATOUI F, BOUSSAID B, ABDELKRIM M N. Local minimum solution for the potential field method in multiple robot motion planning task [C]//Proceedings of the 16th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering. Monastir: IEEE, 2016: 452–457.
18 ZENG J, ZHANG B, SREENATH K. Safety-critical model predictive control with discrete-time control barrier function [C]//Proceedings of the American Control Conference. New Orleans: IEEE, 2021: 3882–3889.
[1] 汪洋,刘红超,田池,吴兵,张笛. 航线交换机制下多船避碰的策略学习与博弈决策[J]. 浙江大学学报(工学版), 2026, 60(5): 964-976.
[2] 杨青青,唐润朋,彭艺. 通信感知一体化系统中的联合波形与相移设计[J]. 浙江大学学报(工学版), 2026, 60(4): 906-914.
[3] 高洪伟,尚秉旭,张鑫康,王洪峰,何维,裴晓飞. 基于可达集和强化学习的智能汽车决策规划[J]. 浙江大学学报(工学版), 2025, 59(9): 1996-2004.
[4] 翟亚红,陈雅玲,徐龙艳,龚玉. 改进YOLOv8s的轻量级无人机航拍小目标检测算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1708-1717.
[5] 柳佳乐,薛雅丽,崔闪,洪君. 动态窗口法引导的TD3无地图导航算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1671-1679.
[6] 郝琨,孟璇,赵晓芳,李志圣. 融合自适应势场法和深度强化学习的三维水下AUV路径规划方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1451-1461.
[7] 赵威,张万枝,侯加林,侯瑞,李玉华,赵乐俊,程进. 基于改进深度强化学习算法的农业机器人路径规划[J]. 浙江大学学报(工学版), 2025, 59(7): 1492-1503.
[8] 李颂元,朱祥维,李玺. 基座模型技术背景下的具身智能体综述[J]. 浙江大学学报(工学版), 2025, 59(2): 213-226.
[9] 汤佳伟,郭铁铮,闻英友. 基于强化学习的Kubernetes云边协同计算调度算法[J]. 浙江大学学报(工学版), 2025, 59(11): 2400-2408.
[10] 张名芳,马健,赵娜乐,王力,刘颖. 无信号交叉口处基于深度强化学习的智能网联车辆运动规划[J]. 浙江大学学报(工学版), 2024, 58(9): 1923-1934.
[11] 叶宝林,孙瑞涛,吴维敏,陈滨,姚青. 基于异步优势演员-评论家的交通信号控制方法[J]. 浙江大学学报(工学版), 2024, 58(8): 1671-1680.
[12] 张会娟,李坤鹏,姬淼鑫,刘振江,刘建娟,张弛. 基于空间相关性增强的无人机检测算法[J]. 浙江大学学报(工学版), 2024, 58(3): 468-479.
[13] 王卓,李永强,冯宇,冯远静. 两方零和马尔科夫博弈策略梯度算法及收敛性分析[J]. 浙江大学学报(工学版), 2024, 58(3): 480-491.
[14] 刘宇庭,郭世杰,唐术锋,张学炜,李田田. 改进A*与ROA-DWA融合的机器人路径规划[J]. 浙江大学学报(工学版), 2024, 58(2): 360-369.
[15] 王义娜,曹晨,杨佳琪,俞彦军,傅国强,王硕玉. 考虑个体习惯的轮椅机器人人机共享避障方法[J]. 浙江大学学报(工学版), 2024, 58(11): 2299-2308.