Safe hierarchical reinforcement learning framework for dynamic UAV navigation

doi:10.3785/j.issn.1008-973X.2026.06.011

Journal of ZheJiang University (Engineering Science)

2026, Vol. 60

Issue (6): 1240-1250 DOI: 10.3785/j.issn.1008-973X.2026.06.011

Safe hierarchical reinforcement learning framework for dynamic UAV navigation

Yiming SHANG(

),Changping DU*(

),Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG

School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China

Download:

HTML

PDF(1760KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

The safe hierarchical intelligent exploration learning (SHIELD) framework was proposed in order to address the problem of UAV navigation and obstacle avoidance in complex dynamic environment. The framework comprised a four-layer progressive safety-assurance architecture. 1) The reinforcement learning decision-making layer was responsible for global path planning. 2) The expert guidance layer optimized local path via an improved dynamic window approach. 3) The safety assurance layer combined artificial potential field method and control barrier function in order to provide emergency safety constraint. 4) The primal–dual optimization layer optimized long-term policy through a flexible optimization mechanism. A dynamic adaptive reward function was designed, in which the reward weight was adaptively adjusted according to environmental complexity and task progress. Results showed that SHIELD achieved a task success rate of 95.7% and a path efficiency of 0.962 in complex dynamic environment, representing improvement of 48.8% and 30.2% over the reinforcement learning baseline algorithm, and average improvement of 55.0% and 36.0% over three traditional comparative algorithms. The safety and efficiency of UAV navigation in dynamic environment were effectively enhanced.

Key words： unmanned aerial vehicle (UAV) reinforcement learning dynamic obstacle avoidance planning control barrier function primal-dual optimization dynamic window approach

Received: 24 August 2025 Published: 06 May 2026

CLC:	V 279
	TP 18

Corresponding Authors: Changping DU E-mail: 22424059@zju.edu.cn;duchangping@zju.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Yiming SHANG
	Changping DU
	Rui YANG
	Tianrui FANG
	Ze’an DU
	Yao ZHENG

Cite this article:

Yiming SHANG,Changping DU,Rui YANG,Tianrui FANG,Ze’an DU,Yao ZHENG. Safe hierarchical reinforcement learning framework for dynamic UAV navigation. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1240-1250.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.06.011 OR https://www.zjujournals.com/eng/Y2026/V60/I6/1240

动态环境无人机导航的安全分层强化学习框架

针对无人机在复杂动态环境中导航和避障的问题，提出安全分层智能探索学习（SHIELD）框架. 该框架为4层递进式安全保障架构. 1）强化学习决策层负责全局路径规划. 2）专家指导层通过改进的动态窗口法，优化局部路径. 3）安全保障层结合人工势场法和控制屏障函数，提供紧急安全约束. 4）原始-对偶优化层通过柔性优化机制，优化长期策略. 设计动态自适应奖励函数，根据环境复杂度和任务进度自适应调整奖励权重. 结果表明，SHIELD在复杂动态环境中的任务成功率达到95.7%，路径效率达到0.962，较强化学习基线算法提升48.8%和30.2%，较3种传统对比算法平均提升55.0%和36.0%，有效提升了无人机在动态环境中的导航安全性和效率.

关键词： 无人机(UAV), 强化学习, 动态避障规划, 控制屏障函数, 原始-对偶优化, 动态窗口法

Fig.1 Environmental model for UAV path planning

Tab.1 Threat and obstacle density weight under different level

Tab.2 Weight distribution under adaptive reward shaping

Fig.2 Overall architecture of SHIELD

Tab.3 Main hyperparameter of SAC algorithm

Fig.3 Execution process of PDO algorithm

Tab.4 Main control parameter and environmental parameter in simulation experiment

Fig.4 Reward function curve

Fig.5 Cumulative success rate curve

Fig.6 Path of UAV in complete framework experiment

Tab.5 Comparison of success rate and path efficiency between proposed algorithm and traditional algorithm

Fig.7 UAV path comparison of three traditional algorithms

Tab.6 Comparison of success rate, path efficiency, average turning angle, and average decision time in ablation study

Tab.7 Comparison of success rate and path efficiency under different obstacle configuration

Tab.8 Comparison of success rate, path efficiency and average decision time under different interference

Fig.8 UAV path under multiple disturbance condition


[1]	LI Y, ZENG Q, SHAO C, et al UAV localization method with keypoints on the edges of semantic objects for low-altitude economy[J]. Drones, 2024, 9 (1): 14 doi: 10.3390/drones9010014

[2]	WANG Z, XIANG X. Improved Astar algorithm for path planning of marine robot [C]//Proceedings of the 37th Chinese Control Conference. Wuhan: IEEE, 2018: 5410-5414.

[3]	QI J, YANG H, SUN H MOD-RRT*: a sampling-based algorithm for robot path planning in dynamic environment[J]. IEEE Transactions on Industrial Electronics, 2021, 68 (8): 7244- 7251 doi: 10.1109/TIE.2020.2998740

[4]	YANG Y, CHEN Z Optimization of dynamic obstacle avoidance path of multirotor UAV based on ant colony algorithm[J]. Wireless Communications and Mobile Computing, 2022, (1): 1299434

[5]	SHORAKAEI H, VAHDANI M, IMANI B, et al Optimal cooperative path planning of unmanned aerial vehicles by a parallel genetic algorithm[J]. Robotica, 2016, 34 (4): 823- 836 doi: 10.1017/S0263574714001878

[6]	YU Z, SI Z, LI X, et al A novel hybrid particle swarm optimization algorithm for path planning of UAVs[J]. IEEE Internet of Things Journal, 2022, 9 (22): 22547- 22558 doi: 10.1109/JIOT.2022.3182798

[7]	AZAR A T, KOUBAA A, MOHAMED N A, et al Drone deep reinforcement learning: a review[J]. Electronics, 2021, 10 (9): 999 doi: 10.3390/electronics10090999

[8]	OUBBATI O S, ATIQUZZAMAN M, BAZ A, et al Dispatch of UAVs for urban vehicular networks: a deep reinforcement learning approach[J]. IEEE Transactions on Vehicular Technology, 2021, 70 (12): 13174- 13189 doi: 10.1109/TVT.2021.3119070

[9]	SONNY A, YEDURI S R, CENKERAMADDI L R Q-learning-based unmanned aerial vehicle path planning with dynamic obstacle avoidance[J]. Applied Soft Computing, 2023, 147: 110773 doi: 10.1016/j.asoc.2023.110773

[10]	LI D, YIN W, WONG W E, et al Quality-oriented hybrid path planning based on A* and Q-learning for unmanned aerial vehicle[J]. IEEE Access, 2021, 10: 7664- 7674 doi: 10.1109/access.2021.3139534

[11]	THOMAS P S, DA SILVA B C, BARTO A G, et al Preventing undesirable behavior of intelligent machines[J]. Science, 2019, 366 (6468): 999- 1004 doi: 10.1126/science.aag3311

[12]	HE Y, HOU T, WANG M A new method for unmanned aerial vehicle path planning in complex environments[J]. Scientific Reports, 2024, 14: 9257 doi: 10.1038/s41598-024-60051-4

[13]	XU L, XI M, GAO R, et al Dynamic path planning of UAV with least inflection point based on adaptive neighborhood A* algorithm and multi-strategy fusion[J]. Scientific Reports, 2025, 15: 8563 doi: 10.1038/s41598-025-92406-w

[14]	HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [EB/OL]. [2025-08-10]. https://arxiv.org/abs/1801.01290.

[15]	FOX D, BURGARD W, THRUN S The dynamic window approach to collision avoidance[J]. IEEE Robotics and Automation Magazine, 1997, 4 (1): 23- 33 doi: 10.1109/100.580977

[16]	KHATIB O. Real-time obstacle avoidance for manipulators and mobile robots [M]//Autonomous robot vehicles. New York: Springer, 1990: 396–404.

[17]	MATOUI F, BOUSSAID B, ABDELKRIM M N. Local minimum solution for the potential field method in multiple robot motion planning task [C]//Proceedings of the 16th International Conference on Sciences and Techniques of Automatic Control and Computer Engineering. Monastir: IEEE, 2016: 452–457.

[18]	ZENG J, ZHANG B, SREENATH K. Safety-critical model predictive control with discrete-time control barrier function [C]//Proceedings of the American Control Conference. New Orleans: IEEE, 2021: 3882–3889.

[1]	Yang WANG,Hongchao LIU,Chi TIAN,Bing WU,Di ZHANG. Multi-ship collision avoidance via route exchange mechanism: strategy learning and game-theoretic decision making[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(5): 964-976.

[2]	Qingqing YANG,Runpeng TANG,Yi PENG. Joint waveform and phase shift design in integrated sensing and communication systems[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 906-914.

[3]	Hongwei GAO,Bingxu SHANG,Xinkang ZHANG,Hongfeng WANG,Wei HE,Xiaofei PEI. Decision-making and planning of intelligent vehicle based on reachable set and reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1996-2004.

[4]	Jiale LIU,Yali XUE,Shan CUI,Jun HONG. TD3 mapless navigation algorithm guided by dynamic window approach[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1671-1679.

[5]	Yahong ZHAI,Yaling CHEN,Longyan XU,Yu GONG. Improved YOLOv8s lightweight small target detection algorithm of UAV aerial image[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1708-1717.

[6]	Kun HAO,Xuan MENG,Xiaofang ZHAO,Zhisheng LI. 3D underwater AUV path planning method integrating adaptive potential field method and deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1451-1461.

[7]	Wei ZHAO,Wanzhi ZHANG,Jialin HOU,Rui HOU,Yuhua LI,Lejun ZHAO,Jin Cheng. Path planning of agricultural robots based on improved deep reinforcement learning algorithm[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1492-1503.

[8]	Songyuan LI,Xiangwei ZHU,Xi LI. Survey of embodied agent in context of foundation model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 213-226.

[9]	Jiawei TANG,Tiezheng GUO,Yingyou WEN. Reinforcement learning-based scheduling algorithm for cloud-edge collaborative computing on Kubernetes[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2400-2408.

[10]	Mingfang ZHANG,Jian MA,Nale ZHAO,Li WANG,Ying LIU. Intelligent connected vehicle motion planning at unsignalized intersections based on deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1923-1934.

[11]	Baolin YE,Ruitao SUN,Weimin WU,Bin CHEN,Qing YAO. Traffic signal control method based on asynchronous advantage actor-critic[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(8): 1671-1680.

[12]	Tianmin DENG,Xinxin CHENG,Jinfeng LIU,Xiyue ZHANG. Small target detection algorithm for aerial images based on feature reuse mechanism[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 437-448.

[13]	Huijuan ZHANG,Kunpeng LI,Miaoxin JI,Zhenjiang LIU,Jianjuan LIU,Chi ZHANG. UAV detection algorithm based on spatial correlation enhancement[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 468-479.

[14]	Zhuo WANG,Yongqiang LI,Yu FENG,Yuanjing FENG. Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 480-491.

[15]	Yina WANG,Chen CAO,Jiaqi YANG,Yanjun YU,Guoqiang FU,Shuoyu WANG. Human-machine shared obstacle avoidance method for wheelchair robot considering individual habit[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(11): 2299-2308.

Viewed

Full text

Abstract

Cited

Shared

Discussed