Please wait a minute...
浙江大学学报(工学版)  2023, Vol. 57 Issue (8): 1479-1486    DOI: 10.3785/j.issn.1008-973X.2023.08.001
计算机技术     
基于改进强化学习的多智能体追逃对抗
薛雅丽(),叶金泽,李寒雁
南京航空航天大学 自动化学院,江苏 南京 211106
Multi-agent pursuit and evasion games based on improved reinforcement learning
Ya-li XUE(),Jin-ze YE,Han-yan LI
College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
 全文: PDF(1158 KB)   HTML
摘要:

针对多智能体追逃问题,提出基于优先经验回放和解耦奖励函数的多智能体强化学习算法. 将多智能体深度确定性策略梯度算法(MADDPG)和双延迟-确定策略梯度算法(TD3)相结合,提出多智能体双延迟-确定策略梯度算法(MATD3). 针对多智能体追逃问题中奖励函数存在大量稀疏奖励的问题,提出利用优先经验回放方法确定经验优先度以及采样高价值经验. 设计解耦奖励函数,将奖励函数分为个体奖励和联合奖励以最大化全局奖励和局部奖励,提出DEPER-MATD3算法. 基于此算法设计仿真实验,并与其他算法对比,实验结果表明,该算法有效解决了过估计问题,且耗时相比MATD3算法有所减少. 在解耦奖励函数环境下该算法训练的追击者的全局平均奖励升高,追击者有更大的概率追击到逃逸者.

关键词: 追逃对抗强化学习经验回放多智能体奖励函数    
Abstract:

A multi-agent reinforcement learning algorithm based on priority experience replay and decomposed reward function was proposed in multi-agent pursuit and evasion games. Firstly, multi-agent twin delayed deep deterministic policygradient algorithm (MATD3) algorithm based on multi-agent deep deterministic policy gradient algorithm (MADDPG) and twin delayed deep deterministic policy gradient algorithm (TD3) was proposed. Secondly, the priority experience replay was proposed to determine the priority of experience and sample the experience with high reward, aiming at the problem that the reward function is almost sparse in the multi-agent pursuit and evasion problem. In addition, a decomposed reward function was designed to divide multi-agent rewards into individual rewards and joint rewards to maximize the global and local rewards. Finally, a simulation experiment was designed based on DEPER-MATD3. Comparison with other algorithms showed that DEPER-MATD3 algorithm solved the over-estimation problem, and the time consumption was improved compared with MATD3 algorithm. In the decomposed reward function environment, the global mean rewards of the pursuers were improved, and the pursuers had a greater probability of chasing the evader.

Key words: pursuit-evasion games    reinforcement learning    experience replay    multi agent    reward function
收稿日期: 2022-11-23 出版日期: 2023-08-31
CLC:  TP 242.6  
基金资助: 国家自然科学基金资助项目(62073164)
作者简介: 薛雅丽(1974—),女,副教授,从事飞行器自适应控制、多智能体协同控制以及目标识别研究. orcid.org/0000-0002-6514-369X. E-mail: xueyali@nuaa.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
薛雅丽
叶金泽
李寒雁

引用本文:

薛雅丽,叶金泽,李寒雁. 基于改进强化学习的多智能体追逃对抗[J]. 浙江大学学报(工学版), 2023, 57(8): 1479-1486.

Ya-li XUE,Jin-ze YE,Han-yan LI. Multi-agent pursuit and evasion games based on improved reinforcement learning. Journal of ZheJiang University (Engineering Science), 2023, 57(8): 1479-1486.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.08.001        https://www.zjujournals.com/eng/CN/Y2023/V57/I8/1479

图 1  多智能体追逃问题示意图
图 2  集中式训练分布式执行框架
图 3  多智能体追逃仿真环境
训练超参数 超参数符号 数值
折扣因子 γ 0.95
惯性更新率 τ 0.01
经验池大小 ReplayBuffer 1×106
回放样本数 BatchSize 1024
回合数 Ep 60000
每回合时间步 Maxstep 30
神经网络学习率 ρ 0.01
更新率 UpdateFre 100
探索率 ε 0.5
权重优先级 α 0.6
重要性采样参数 β 0.5
表 1  训练超参数设定
图 4  3种强化学习算法训练的平均奖励曲线
图 5  解耦奖励和非解耦奖励下的平均奖励曲线
图 6  3种强化学习算法追逃成功次数曲线图
图 7  单次追逃试验的智能体运动图
1 周浦城, 洪炳镕 基于对策论的群机器人追捕-逃跑问题研究[J]. 哈尔滨工业大学学报, 2003, (9): 1056- 1059
ZHOU Pu-cheng, HONG Bing-rong Research on the pursuit and escape problem of swarm robots based on game theory[J]. Journal of Harbin Institute of Technology, 2003, (9): 1056- 1059
doi: 10.3321/j.issn:0367-6234.2003.09.010
2 李龙跃, 刘付显, 史向峰, 等 导弹攻防对抗中追逃对策模型与配点求解法[J]. 系统工程与电子技术, 2016, 38 (5): 1067- 1073
LI Long-yue, LIU Fu-xian, SHI Xiang-feng, et al Model of pursuit and escape countermeasures in missile attack and defense countermeasures and collocation solution[J]. Journal of Systems Engineering and Electronics, 2016, 38 (5): 1067- 1073
doi: 10.3969/j.issn.1001-506X.2016.05.15
3 刘坤, 郑晓帅, 林业茗, 等. 基于微分博弈的追逃问题最优策略设计[J]. 2021, 47(8): 1840-1854.
LIU Kun, ZHENG Xiao-shuai, LIN Ye-ming, et al. Optimal strategy design of pursuit and escape problem based on differential game [J]. Journal of Automatica Sinica, 2021, 47(8): 1840-1854.
4 刘肇隆, 宋耀, 徐翊铭, 等 图注意力网络的微分博弈追逃问题最优策略[J]. 计算机工程与应用, 2023, 59 (9): 313- 318
LIU Zhao-long, SONG Yao, XU Yi-ming, et al Optimal strategy of differential game pursuit problem in graph attention network[J]. Computer Engineering and Applications, 2023, 59 (9): 313- 318
5 FANG B, PAN Q, HONG B, et al Research on high speed evader vs. multi lower speed pursuers in multi pursuit-evasion games[J]. Information Technology Journal, 2012, 11 (8): 989- 997
doi: 10.3923/itj.2012.989.997
6 张澄安, 邓文, 王李瑞, 等 基于阿波罗尼奥斯圆的无人机追逃问题研究[J]. 航天电子对抗, 2021, 37 (5): 40- 43,48
ZHANG Cheng-an, DENG Wen, WANG Li-rui, et al Research on the pursuit and escape of UAVs based on Apollonius circle[J]. Aerospace Electronic Warfare, 2021, 37 (5): 40- 43,48
doi: 10.3969/j.issn.1673-2421.2021.05.008
7 苏义鑫, 石兵华, 张华军, 等 水面无人艇的抗追捕-逃跑策略[J]. 哈尔滨工程大学学报, 2018, 39 (6): 1019- 1025
SU Yi-xin, SHI Bing-hua, ZHANG Hua-jun, et al The anti-pursuit and escape strategy of unmanned surface craft[J]. Journal of Harbin Engineering University, 2018, 39 (6): 1019- 1025
doi: 10.11990/jheu.201705092
8 LI J, PAN Q, HONG B A new approach of multi-robot cooperative pursuit based on association rule data mining[J]. International Journal of Advanced Robotic Systems, 2010, 7 (3): 1169- 1174
9 LIU J, LIU S, WU H, et al. A pursuit-evasion algorithm based on hierarchical reinforcement learning[C]// International Conference on Measuring Technology and Mechatronics Automation. Zhangjiajie: IEEE, 2009: 482-486.
10 MOSTAFA D, HOWARD M A decentralized fuzzy learning algorithm for pursuit-evasion differential games with superior evaders[J]. Journal of Intelligent and Robotic Systems, 2016, 83 (1): 35- 53
doi: 10.1007/s10846-015-0315-y
11 ALEXANDRE B, MOULAY A. UAV pursuit using reinforcement learning[EB/OL]. [2022-11-01]. https://www.researchgate.net/publication/333122618_UAV_pursuit_using_reinforcement_learning
12 ZHANG B, HU B, CHEN L, et al. Probabilistic reward-based reinforcement learning for multi-agent pursuit and evasion [EB/OL]. (2021-05-22). https://kns.cnki.net/kcms2/article/abstract?v=YhL_Bl4XtC7yyLQqjQmWvQGFaHRks9Y7gEQxMHvbmL7fMP8_n99K976g8Gkzd7ga2CqCUiYClTJD65ep-1s-zhxIW8yOO67CYj63fkZ7BjY%3d&uniplatform=NZKPT.
13 ZHOU X, ZHOU S, MOU X, et al. Multirobot collaborative pursuit target robot by improved MADDPG [EB/OL]. (2022-02-25). https://www.hindawi.com/journals/cin/2022/4757394/.
14 夏家伟, 朱旭芳, 张建强, 等 基于多智能体强化学习的无人艇协同围捕方法[J]. 控制与决策, 2023, 38 (5): 1438- 1447
XIA Jia-wei, ZHU Xu-fang, ZHANG Jian-qiang, et al Research on the method of unmanned boat cooperative encirclement based on multi-agent reinforcement learning[J]. Control and Decision, 2023, 38 (5): 1438- 1447
doi: 10.13195/j.kzyjc.2022.0564
15 姜立标, 吴中伟 基于趋近律滑模控制的智能车辆轨迹跟踪研究[J]. 农业机械学报, 2018, 49 (3): 381- 386
JIANG Li-biao, WU Zhong-wei Research on intelligent vehicle trajectory tracking based on reaching law sliding mode control[J]. Transactions of the Chinese Society of Agricultural Machinery, 2018, 49 (3): 381- 386
doi: 10.6041/j.issn.1000-1298.2018.03.048
16 赵润晖, 文红, 侯文静 基于MADDPG的边缘网络任务卸载与资源管理[J]. 通信技术, 2021, 54 (4): 864- 868
ZHAO Run-hui, WEN Hong, HOU Wen-jing Edge network task offloading and resource management based on MADDPG[J]. Communication Technology, 2021, 54 (4): 864- 868
doi: 10.3969/j.issn.1002-0802.2021.04.014
17 FUJIMOTO S, HOOF H, MEGER D. Addressing function approximation error in actor-critic methods[EB/OL]. (2018-02-26). https://arxiv.org/abs/1802.09477v1.
18 TOM S, JOHN Q, IOANNIS A, et al. Prioritized experience replay[EB/OL]. (2015-11-18). https://arxiv.org/abs/1511.05952.
19 龚慧雯, 王桐, 陈立伟, 等 基于深度强化学习的多智能体对抗策略算法[J]. 应用科技, 2022, 49 (5): 1- 7
GONG Hui-wen, WANG Tong, CHEN Li-wei, et al Multi-agent confrontation strategy algorithm based on deep reinforcement learning[J]. Applied Science and Technology, 2022, 49 (5): 1- 7
20 SHEIKH H U, BOLONI L. Multi-agent reinforcement learning for problems with combined individual and team reward[C]// 2020 International Joint Conference on Neural Networks (IJCNN). Glasgow: IEEE, 2020: 1-8,
21 符小卫, 王辉, 徐哲 基于DE-MADDPG的多无人机协同追捕策略[J]. 航空学报, 2022, 43 (5): 530- 543
FU Xiao-wei, WANG Hui, XU Zhe Multi-UAV cooperative pursuit strategy based on DE-MADDPG[J]. Acta Aeronautica Et Astronautica Sinica, 2022, 43 (5): 530- 543
[1] 徐少铭,李钰,袁晴龙. 基于强化学习和3σ准则的组合剪枝方法[J]. 浙江大学学报(工学版), 2023, 57(3): 486-494.
[2] 李勇,柳富强,孙柏青,张秋豪,杨俊友. 日常养老情境的异构多机器人动态多任务分配[J]. 浙江大学学报(工学版), 2022, 56(9): 1806-1814.
[3] 华夏,王新晴,芮挺,邵发明,王东. 视觉感知的无人机端到端目标跟踪控制技术[J]. 浙江大学学报(工学版), 2022, 56(7): 1464-1472.
[4] 刘智敏,叶宝林,朱耀东,姚青,吴维敏. 基于深度强化学习的交通信号控制方法[J]. 浙江大学学报(工学版), 2022, 56(6): 1249-1256.
[5] 徐小高,夏莹杰,朱思雨,邝砾. 基于强化学习的多路口可变车道协同控制方法[J]. 浙江大学学报(工学版), 2022, 56(5): 987-994, 1005.
[6] 李广龙,申德荣,聂铁铮,寇月. 数据库外基于多模型的学习式查询优化方法[J]. 浙江大学学报(工学版), 2022, 56(2): 288-296.
[7] 邓齐林,鲁娟,陈勇辉,冯健,廖小平,马俊燕. 基于深度强化学习的数控铣削加工参数优化方法[J]. 浙江大学学报(工学版), 2022, 56(11): 2145-2155.
[8] 张盼,丁华,张颖而,李冰凝,皇甫江涛,金仲和. 基于信息共享的多智能体自主电子干扰系统[J]. 浙江大学学报(工学版), 2022, 56(1): 75-83.
[9] 马一凡,赵凡宇,王鑫,金仲和. 密集观测场景下的敏捷成像卫星任务规划方法[J]. 浙江大学学报(工学版), 2021, 55(6): 1215-1224.
[10] 马一凡,赵凡宇,王鑫,金仲和. 基于改进指针网络的卫星对地观测任务规划方法[J]. 浙江大学学报(工学版), 2021, 55(2): 395-401.
[11] 邵杭蕾,张冬梅. 基于静态输出反馈协议的多智能体系统同步[J]. 浙江大学学报(工学版), 2020, 54(7): 1308-1315.
[12] 张铁,肖蒙,邹焱飚,肖佳栋. 基于强化学习的机器人曲面恒力跟踪研究[J]. 浙江大学学报(工学版), 2019, 53(10): 1865-1873.
[13] 董如良, 杨强, 颜文俊. 多智能体协同寻优的主动配网动态拓扑重构[J]. 浙江大学学报(工学版), 2015, 49(10): 1982-1989.
[14] 郝钏钏, 方舟, 李平. 基于参考模型的输出反馈强化学习控制[J]. J4, 2013, 47(3): 409-414.
[15] 娄柯, 齐斌, 穆文英, 崔宝同. 基于反馈控制策略的多智能体蜂拥控制[J]. J4, 2013, 47(10): 1758-1763.