Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (6): 1240-1250    DOI: 10.3785/j.issn.1008-973X.2026.06.011
    
Safe hierarchical reinforcement learning framework for dynamic UAV navigation
Yiming SHANG(),Changping DU*(),Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG
School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China
Download: HTML     PDF(1760KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The safe hierarchical intelligent exploration learning (SHIELD) framework was proposed in order to address the problem of UAV navigation and obstacle avoidance in complex dynamic environment. The framework comprised a four-layer progressive safety-assurance architecture. 1) The reinforcement learning decision-making layer was responsible for global path planning. 2) The expert guidance layer optimized local path via an improved dynamic window approach. 3) The safety assurance layer combined artificial potential field method and control barrier function in order to provide emergency safety constraint. 4) The primal–dual optimization layer optimized long-term policy through a flexible optimization mechanism. A dynamic adaptive reward function was designed, in which the reward weight was adaptively adjusted according to environmental complexity and task progress. Results showed that SHIELD achieved a task success rate of 95.7% and a path efficiency of 0.962 in complex dynamic environment, representing improvement of 48.8% and 30.2% over the reinforcement learning baseline algorithm, and average improvement of 55.0% and 36.0% over three traditional comparative algorithms. The safety and efficiency of UAV navigation in dynamic environment were effectively enhanced.



Key wordsunmanned aerial vehicle (UAV)      reinforcement learning      dynamic obstacle avoidance planning      control barrier function      primal-dual optimization      dynamic window approach     
Received: 24 August 2025      Published: 06 May 2026
CLC:  V 279  
  TP 18  
Corresponding Authors: Changping DU     E-mail: 22424059@zju.edu.cn;duchangping@zju.edu.cn
Cite this article:

Yiming SHANG,Changping DU,Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG. Safe hierarchical reinforcement learning framework for dynamic UAV navigation. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1240-1250.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.06.011     OR     https://www.zjujournals.com/eng/Y2026/V60/I6/1240


动态环境无人机导航的安全分层强化学习框架

针对无人机在复杂动态环境中导航和避障的问题,提出安全分层智能探索学习(SHIELD)框架. 该框架为4层递进式安全保障架构. 1)强化学习决策层负责全局路径规划. 2)专家指导层通过改进的动态窗口法,优化局部路径. 3)安全保障层结合人工势场法和控制屏障函数,提供紧急安全约束. 4)原始-对偶优化层通过柔性优化机制,优化长期策略. 设计动态自适应奖励函数,根据环境复杂度和任务进度自适应调整奖励权重. 结果表明,SHIELD在复杂动态环境中的任务成功率达到95.7%,路径效率达到0.962,较强化学习基线算法提升48.8%和30.2%,较3种传统对比算法平均提升55.0%和36.0%,有效提升了无人机在动态环境中的导航安全性和效率.


关键词: 无人机(UAV),  强化学习,  动态避障规划,  控制屏障函数,  原始-对偶优化,  动态窗口法 
Fig.1 Environmental model for UAV path planning
等级$ {W}_{\text{thr}} $$ {W}_{\text{den}} $
DWACBFDWACBF
high2.02.51.51.8
medium1.51.81.01.0
low1.01.20.70.9
none0.50.80.50.7
Tab.1 Threat and obstacle density weight under different level
奖励组件GlobalSafetyApproach
Goal1.01.01.0
Direction1.80.82.2
Collision1.01.01.0
Boundary1.01.01.0
Static0.51.81.2
Dynamic0.62.21.0
Deviation0.71.50.8
Speed2.50.60.9
Energy0.41.41.6
Time0.61.21.6
DWA0.62.81.8
CBF0.82.01.6
Tab.2 Weight distribution under adaptive reward shaping
Fig.2 Overall architecture of SHIELD
超参数数值超参数数值
折扣因子0.99经验回放缓冲区大小100 000
学习率0.000 5批次大小256
软更新系数0.01总训练步数100 000
Tab.3 Main hyperparameter of SAC algorithm
Fig.3 Execution process of PDO algorithm
参数数值参数数值参数数值参数数值
$ {r}_{\text{UAV}} $/m1.0$ {n}_{\text{dyn}} $2$ {W}_{\text{AC}} $0.5$ {W}_{\text{cbf}} $0.3
$ {r}_{\mathrm{g}} $/m3.0$ {W}_{\mathrm{g}} $200$ \Delta t $/s0.1$ {W}_{0} $0.3
$ {r}_{\min } $/m2$ {W}_{\text{dir}} $1.8$ {d}_{\text{fa}\mathrm{r}} $/m70$ {\alpha }_{\tan } $0.8
$ {r}_{\max } $/m6$ {W}_{\text{col}} $20$ {d}_{\text{nar}} $/m25$ {d}_{\text{bnd}} $/m0.8
$ {r}_{\text{sen}} $/m10$ {W}_{\text{bnd}} $20$ \alpha $0.8$ {\lambda }_{\text{smo}} $0.3
$ {v}_{\max } $/(m·s?1)5.0$ {W}_{\text{sta}} $1.0$ {\beta }_{\text{DWA}} $1.2$ {C}_{\text{stp}} $2
$ {\omega }_{\max } $/(rad·s?1)$ \text{π} /2 $$ {W}_{\text{dyn}} $2.0$ \gamma $0.15$ {C}_{\text{ep}} $15
$ {v}_{\mathrm{o}1} $/(m·s?1)1.0$ {W}_{\text{dev}} $1.5$ \delta $0.02$ {\alpha }_{\text{stp}} $0.008
$ {v}_{\mathrm{o}2} $/(m·s?1)2.5$ {W}_{\text{spd}} $1.2$ \varepsilon $0.6$ {\alpha }_{\text{ep}} $0.01
$ {\theta }_{\text{fov}} $/(°)120$ {W}_{\text{eng}} $1.8$ {d}_{\text{TH}} $/m20$ {\beta }_{\text{PDO}} $0.85
$ {N}_{\max } $8$ {W}_{\text{tim}} $0.3$ {v}_{\text{TH}} $/(m·s?1)0.3$ {\lambda }_{\max } $20
$ {n}_{\text{sta}} $6$ {W}_{\text{DWA}} $1.5$ {t}_{\text{TH}} $/s6$ {D}_{\mathrm{s}} $/m3
Tab.4 Main control parameter and environmental parameter in simulation experiment
Fig.4 Reward function curve
Fig.5 Cumulative success rate curve
Fig.6 Path of UAV in complete framework experiment
算法Sc/%E
APF51.60.738
RRT62.80.665
DWA75.40.716
所提算法95.70.962
Tab.5 Comparison of success rate and path efficiency between proposed algorithm and traditional algorithm
Fig.7 UAV path comparison of three traditional algorithms
序号+DWA+CBF+PDOSc/%E$ {\theta }_{\text{avg}} $/(°)$ {T}_{\text{avg}} $/ms
064.30.73921.411.87
178.80.7799.992.41
285.60.89317.722.03
375.20.75213.751.88
490.30.92212.702.54
583.40.8078.932.43
689.90.90814.822.11
795.70.9629.412.57
Tab.6 Comparison of success rate, path efficiency, average turning angle, and average decision time in ablation study
障碍物数量配置(静+动)Sc/%E
10+495.70.962
12+592.30.931
15+689.60.908
Tab.7 Comparison of success rate and path efficiency under different obstacle configuration
算法与条件Sc/%E$ {T}_{\text{avg}} $/ms
无干扰95.70.9622.57
单阵风干扰81.60.8632.61
单雷达噪声95.30.9622.57
单GPS误差95.60.9612.57
多干扰APF31.80.5931.06
多干扰RRT38.70.5122.07
多干扰DWA55.30.6382.49
多干扰,本文方法81.40.8612.86
Tab.8 Comparison of success rate, path efficiency and average decision time under different interference
Fig.8 UAV path under multiple disturbance condition
[1]   LI Y, ZENG Q, SHAO C, et al UAV localization method with keypoints on the edges of semantic objects for low-altitude economy[J]. Drones, 2024, 9 (1): 14
doi: 10.3390/drones9010014
[2]   WANG Z, XIANG X. Improved Astar algorithm for path planning of marine robot [C]//Proceedings of the 37th Chinese Control Conference. Wuhan: IEEE, 2018: 5410-5414.
[3]   QI J, YANG H, SUN H MOD-RRT*: a sampling-based algorithm for robot path planning in dynamic environment[J]. IEEE Transactions on Industrial Electronics, 2021, 68 (8): 7244- 7251
doi: 10.1109/TIE.2020.2998740
[4]   YANG Y, CHEN Z Optimization of dynamic obstacle avoidance path of multirotor UAV based on ant colony algorithm[J]. Wireless Communications and Mobile Computing, 2022, (1): 1299434
[5]   SHORAKAEI H, VAHDANI M, IMANI B, et al Optimal cooperative path planning of unmanned aerial vehicles by a parallel genetic algorithm[J]. Robotica, 2016, 34 (4): 823- 836
doi: 10.1017/S0263574714001878
[6]   YU Z, SI Z, LI X, et al A novel hybrid particle swarm optimization algorithm for path planning of UAVs[J]. IEEE Internet of Things Journal, 2022, 9 (22): 22547- 22558
doi: 10.1109/JIOT.2022.3182798
[7]   AZAR A T, KOUBAA A, MOHAMED N A, et al Drone deep reinforcement learning: a review[J]. Electronics, 2021, 10 (9): 999
doi: 10.3390/electronics10090999
[8]   OUBBATI O S, ATIQUZZAMAN M, BAZ A, et al Dispatch of UAVs for urban vehicular networks: a deep reinforcement learning approach[J]. IEEE Transactions on Vehicular Technology, 2021, 70 (12): 13174- 13189
doi: 10.1109/TVT.2021.3119070
[9]   SONNY A, YEDURI S R, CENKERAMADDI L R Q-learning-based unmanned aerial vehicle path planning with dynamic obstacle avoidance[J]. Applied Soft Computing, 2023, 147: 110773
doi: 10.1016/j.asoc.2023.110773
[10]   LI D, YIN W, WONG W E, et al Quality-oriented hybrid path planning based on A* and Q-learning for unmanned aerial vehicle[J]. IEEE Access, 2021, 10: 7664- 7674
doi: 10.1109/access.2021.3139534
[11]   THOMAS P S, DA SILVA B C, BARTO A G, et al Preventing undesirable behavior of intelligent machines[J]. Science, 2019, 366 (6468): 999- 1004
doi: 10.1126/science.aag3311
[12]   HE Y, HOU T, WANG M A new method for unmanned aerial vehicle path planning in complex environments[J]. Scientific Reports, 2024, 14: 9257
doi: 10.1038/s41598-024-60051-4
[13]   XU L, XI M, GAO R, et al Dynamic path planning of UAV with least inflection point based on adaptive neighborhood A* algorithm and multi-strategy fusion[J]. Scientific Reports, 2025, 15: 8563
doi: 10.1038/s41598-025-92406-w
[14]   HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [EB/OL]. [2025-08-10]. https://arxiv.org/abs/1801.01290.
[15]   FOX D, BURGARD W, THRUN S The dynamic window approach to collision avoidance[J]. IEEE Robotics and Automation Magazine, 1997, 4 (1): 23- 33
doi: 10.1109/100.580977
[16]   KHATIB O. Real-time obstacle avoidance for manipulators and mobile robots [M]//Autonomous robot vehicles. New York: Springer, 1990: 396–404.
[17]   MATOUI F, BOUSSAID B, ABDELKRIM M N. Local minimum solution for the potential field method in multiple robot motion planning task [C]//Proceedings of the 16th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering. Monastir: IEEE, 2016: 452–457.
[18]   ZENG J, ZHANG B, SREENATH K. Safety-critical model predictive control with discrete-time control barrier function [C]//Proceedings of the American Control Conference. New Orleans: IEEE, 2021: 3882–3889.
[1] Yang WANG,Hongchao LIU,Chi TIAN,Bing WU,Di ZHANG. Multi-ship collision avoidance via route exchange mechanism: strategy learning and game-theoretic decision making[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(5): 964-976.
[2] Qingqing YANG,Runpeng TANG,Yi PENG. Joint waveform and phase shift design in integrated sensing and communication systems[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 906-914.
[3] Hongwei GAO,Bingxu SHANG,Xinkang ZHANG,Hongfeng WANG,Wei HE,Xiaofei PEI. Decision-making and planning of intelligent vehicle based on reachable set and reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1996-2004.
[4] Jiale LIU,Yali XUE,Shan CUI,Jun HONG. TD3 mapless navigation algorithm guided by dynamic window approach[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1671-1679.
[5] Yahong ZHAI,Yaling CHEN,Longyan XU,Yu GONG. Improved YOLOv8s lightweight small target detection algorithm of UAV aerial image[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1708-1717.
[6] Kun HAO,Xuan MENG,Xiaofang ZHAO,Zhisheng LI. 3D underwater AUV path planning method integrating adaptive potential field method and deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1451-1461.
[7] Wei ZHAO,Wanzhi ZHANG,Jialin HOU,Rui HOU,Yuhua LI,Lejun ZHAO,Jin Cheng. Path planning of agricultural robots based on improved deep reinforcement learning algorithm[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1492-1503.
[8] Songyuan LI,Xiangwei ZHU,Xi LI. Survey of embodied agent in context of foundation model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 213-226.
[9] Jiawei TANG,Tiezheng GUO,Yingyou WEN. Reinforcement learning-based scheduling algorithm for cloud-edge collaborative computing on Kubernetes[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2400-2408.
[10] Mingfang ZHANG,Jian MA,Nale ZHAO,Li WANG,Ying LIU. Intelligent connected vehicle motion planning at unsignalized intersections based on deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1923-1934.
[11] Baolin YE,Ruitao SUN,Weimin WU,Bin CHEN,Qing YAO. Traffic signal control method based on asynchronous advantage actor-critic[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(8): 1671-1680.
[12] Tianmin DENG,Xinxin CHENG,Jinfeng LIU,Xiyue ZHANG. Small target detection algorithm for aerial images based on feature reuse mechanism[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 437-448.
[13] Huijuan ZHANG,Kunpeng LI,Miaoxin JI,Zhenjiang LIU,Jianjuan LIU,Chi ZHANG. UAV detection algorithm based on spatial correlation enhancement[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 468-479.
[14] Zhuo WANG,Yongqiang LI,Yu FENG,Yuanjing FENG. Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 480-491.
[15] Yina WANG,Chen CAO,Jiaqi YANG,Yanjun YU,Guoqiang FU,Shuoyu WANG. Human-machine shared obstacle avoidance method for wheelchair robot considering individual habit[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2299-2308.