Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (7): 1369-1380    DOI: 10.3785/j.issn.1008-973X.2026.07.001
计算机与控制工程     
基于自适应课程强化学习的多无人艇对抗围捕决策
陈浪(),刘增力,赵宣植*()
昆明理工大学 信息工程与自动化学院,云南 昆明 650500
Decision-making for multi-USV adversarial encirclement based on adaptive curriculum reinforcement learning
Lang CHEN(),Zengli LIU,Xuanzhi ZHAO*()
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
 全文: PDF(5491 KB)   HTML
摘要:

针对无人艇集群在复杂海洋环境中的围捕与博弈对抗问题,提出基于自适应课程学习和多智能体近端策略优化(MAPPO)算法的围捕决策方法. 针对海上围捕与防御任务,构建包含动态目标与多岛礁的作战仿真环境,确定围捕成功及任务终止的判定条件. 为了提升复杂对抗环境下的决策效率,设计表征敌我双方运动态势的归一化状态空间、多尺度奖励函数以及连续动作空间. 在集中式训练、分散式执行的训练框架中引入自适应课程调度器,动态调整训练环境复杂度和动作探索的噪声强度. 多组对抗仿真实验的结果表明,相较于基准方法(如无课程学习和传统课程学习方法),所提方法能够有效提升围捕成功率,缩短平均任务完成时间和平均围捕路径长度,降低碰撞次数,并展现出良好的泛化能力与对抗适应性.

关键词: 无人艇围捕与防御自适应课程学习多智能体强化学习多智能体近端策略优化MAPPO    
Abstract:

An encirclement decision-making method based on adaptive curriculum learning and multi-agent proximal policy optimization (MAPPO) algorithm was proposed to address the problem of encirclement and game confrontation for unmanned surface vehicle (USV) swarms in complex marine environments. For maritime encirclement and defense tasks, a combat simulation environment including dynamic targets and multiple reefs was constructed, and the judgement criteria for successful encirclement and task termination were defined. To improve the decision-making efficiency in complex adversarial environments, a normalized state space representing the motion states of both friendly and hostile parties, multi-scale reward functions, and a continuous action space were designed. An adaptive curriculum scheduler was introduced into the centralized training and decentralized execution framework to dynamically adjust the environmental complexity and the noise intensity of action exploration. The results of multiple sets of adversarial simulation experiments indicate that, compared to the baseline methods (i.e., no curriculum learning and traditional curriculum learning methods), the proposed method effectively improves the encirclement success rate, shortens the average task completion time and the average encirclement path length, reduces the number of collisions, and exhibits good generalization ability and adversarial adaptability.

Key words: unmanned surface vehicle    encirclement and defense    adaptive curriculum learning    multi-agent reinforcement learning    multi-agent proximal policy optimization    MAPPO
收稿日期: 2025-06-23 出版日期: 2026-05-23
CLC:  TP 249  
基金资助: 汉江国家实验室资助项目(KF2024025);国防科技重点实验室基金资助项目(2023JCJQLB3301).
通讯作者: 赵宣植     E-mail: 2594489733@qq.com;zhaoxuanzhi@kust.edu.cn
作者简介: 陈浪(1998—),男,硕士生,从事多智能体强化学习、无人艇智能控制研究. orcid.org/0009-0003-9574-1898. E-mail:2594489733@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
陈浪
刘增力
赵宣植

引用本文:

陈浪,刘增力,赵宣植. 基于自适应课程强化学习的多无人艇对抗围捕决策[J]. 浙江大学学报(工学版), 2026, 60(7): 1369-1380.

Lang CHEN,Zengli LIU,Xuanzhi ZHAO. Decision-making for multi-USV adversarial encirclement based on adaptive curriculum reinforcement learning. Journal of ZheJiang University (Engineering Science), 2026, 60(7): 1369-1380.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.07.001        https://www.zjujournals.com/eng/CN/Y2026/V60/I7/1369

图 1  USV运动模型图
图 2  围捕成功的约束条件示意图
图 3  己方USV观测模型
图 4  目标USV观测模型
图 5  自适应课程学习-多智能体近端策略优化算法训练架构
图 6  最高难度的二维作战仿真场景
参数数值
$ {E}_{\text{p},\max } $10
$ {\rho }_{{{g}_{t}}} $0.75, 0.65, 0.55, 0.45, 0.40
$ \lambda $$ 0.6 $
$ {E}_{{{g}_{t}},\min } $20, 35, 60, 70, 80
$ {W}_{\text{base}} $60, 70, 80, 90, 100
表 1  自适应课程学习调度器的参数配置
场景布局$ v_{\text{max}}^{\text{T}} $/($ \mathrm{m}\cdot {\mathrm{s}}^{-1} $)$ {d}_{\text{fin}} $/m$ {h}_{x} $/m
370225
W560275
$ {H}_{\text{1}}\text{,}W $755315
$ {H}_{1}\text{,}{H}_{2}\text{,}W $950365
$ {H}_{1}\text{,}{H}_{2}\text{,}{H}_{3}\text{,}W $1045400
表 2  自适应课程学习环境的参数配置
参数数值参数数值
$ {k}_{\text{d1}} $,$ {k}_{\text{d2}} $0.01, 0.15$ {K}_{\text{aph}} $300
$ {k}_{\text{a1}} $,$ {k}_{\text{a2}} $0.5, 0.1$ {K}_{\text{eva}} $0.8
$ {k}_{\text{v}} $0.5$ {w}_{1},{w}_{2},{w}_{3},{w}_{4} $0.35, 0.17, 0.28, 0.20
表 3  奖励函数参数设置
参数数值
MAPPOPPO
M2 0481 024
Bs256128
$ {l}_{{\mathrm{r}}} $2×10?410?4
$ \gamma $0.990.99
$ \epsilon $0.20.2
$ {\lambda }_{\text{GAE}} $0.950.95
$ {\alpha }_{\text{H}} $0.010.01
Tmax300300
Eepoch103
τclip0.5
nh2×2562×128
表 4  MAPPO和PPO算法的超参数设置
图 7  不同USV的奖励曲线
图 8  不同算法的性能对比
图 9  各算法在2种目标运动模式下的性能比较
图 10  目标USV的速度与加速度变化曲线
图 11  围捕过程中己方USV的参数变化曲线
图 12  成功围捕目标的轨迹图
目标策略算法$ {C}_{\text{u}} $$ {C}_{\text{T}} $$ {P}_{\text{u}} $$ {P}_{\text{T}} $$ {D}_{\text{avg}} $/m$ {R}_{\text{avg}} $
RLACL-MAPPO19440462302.32455.253
CL-MAPPO364587122367.47442.437
NOCL-MAPPO1 40113443238421.16117.367
ACL-MADDPG389678331381.19340.152
RandomACL-MAPPO94230115454285.39137.113
CL-MAPPO1 22430420552321.48230.172
NOCL-MAPPO1 55815036756343.51413.627
ACL-MADDPG1 28731722161337.64325.241
表 5  各算法在2种目标策略下的性能指标对比
图 13  不同算法的围捕对抗仿真轨迹图
1 张家奎, 李晓东, 周河宇, 等 俄乌冲突中无人艇作战运用的分析研究[J]. 数字海洋与水下攻防, 2024, 7 (6): 616- 622
ZHANG Jiakui, LI Xiaodong, ZHOU Heyu, et al Analysis and research on operational application of unmanned surface vehicles in Russia-Ukraine conflict[J]. Digital Ocean & Underwater Warfare, 2024, 7 (6): 616- 622
2 梁晓龙, 杨爱武, 张佳强, 等 无人集群博弈对抗系统仿真验证及决策关键技术综述[J]. 系统仿真学报, 2024, 36 (4): 805- 816
LIANG Xiaolong, YANG Aiwu, ZHANG Jiaqiang, et al Simulation verification and decision-making key technologies of unmanned swarm game confrontation: a survey[J]. Journal of System Simulation, 2024, 36 (4): 805- 816
doi: 10.16182/j.issn1004731x.joss.23-0072
3 宋利飞, 徐凯凯, 史晓骞, 等 多无人艇协同围捕智能逃跑目标方法研究[J]. 中国舰船研究, 2023, 18 (1): 52- 59
SONG Lifei, XU Kaikai, SHI Xiaoqian, et al Multiple USV cooperative algorithm method for hunting intelligent escaped targets[J]. Chinese Journal of Ship Research, 2023, 18 (1): 52- 59
4 CHEN M, ZHU D, PANG W, et al An effective strategy for distributed unmanned underwater vehicles to encircle and capture intelligent targets[J]. IEEE Transactions on Industrial Electronics, 2024, 71 (10): 12570- 12580
doi: 10.1109/tie.2023.3342281
5 杨惠珍, 李建国, 吴天宇, 等 基于逃逸角的多ASV微分博弈协同围捕方法[J]. 水下无人系统学报, 2024, 32 (4): 730- 738
YANG Huizhen, LI Jianguo, WU Tianyu, et al Cooperative hunting method for multiple ASVs using differential games based on escape angle[J]. Journal of Unmanned Undersea Systems, 2024, 32 (4): 730- 738
6 薛雅丽, 叶金泽, 李寒雁 基于改进强化学习的多智能体追逃对抗[J]. 浙江大学学报: 工学版, 2023, 57 (8): 1479- 1486
XUE Yali, YE Jinze, LI Hanyan Multi-agent pursuit and evasion games based on improved reinforcement learning[J]. Journal of Zhejiang University: Engineering Science, 2023, 57 (8): 1479- 1486
doi: 10.3785/j.issn.1008-973X.2023.08.001
7 于长东, 刘新阳, 陈聪, 等 基于多智能体深度强化学习的无人艇集群博弈对抗研究[J]. 水下无人系统学报, 2024, 32 (1): 79- 86
YU Changdong, LIU Xinyang, CHEN Cong, et al Research on game confrontation of unmanned surface vehicles swarm based on multi-agent deep reinforcement learning[J]. Journal of Unmanned Undersea Systems, 2024, 32 (1): 79- 86
doi: 10.11993/j.issn.2096-3920.2023-0159
8 QU X, GAN W, SONG D, et al Pursuit-evasion game strategy of USV based on deep reinforcement learning in complex multi-obstacle environment[J]. Ocean Engineering, 2023, 273: 114016
doi: 10.1016/j.oceaneng.2023.114016
9 LI F, YIN M, WANG T, et al Distributed pursuit-evasion game of limited perception USV swarm based on multiagent proximal policy optimization[J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024, 54 (10): 6435- 6446
doi: 10.1109/TSMC.2024.3429467
10 LI B, WANG J, SONG C, et al Multi-UAV roundup strategy method based on deep reinforcement learning CEL-MADDPG algorithm[J]. Expert Systems with Applications, 2024, 245: 123018
doi: 10.1016/j.eswa.2023.123018
11 XIA J, LUO Y, LIU Z, et al Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning[J]. Defence Technology, 2023, 29: 80- 94
doi: 10.1016/j.dt.2022.09.014
12 HOU Y, HAN G, ZHANG F, et al Distributional soft actor-critic-based multi-AUV cooperative pursuit for maritime security protection[J]. IEEE Transactions on Intelligent Transportation Systems, 2024, 25 (6): 6049- 6060
doi: 10.1109/TITS.2023.3341034
13 苏震, 张钊, 陈聪, 等 基于深度强化学习的无人艇集群博弈对抗[J]. 兵器装备工程学报, 2022, 43 (9): 9- 14
SU Zhen, ZHANG Zhao, CHEN Cong, et al Deep reinforcement learning based swarm game confrontation of unmanned surface vehicles[J]. Journal of Ordnance Equipment Engineering, 2022, 43 (9): 9- 14
14 符小卫, 王辉, 徐哲 基于DE-MADDPG的多无人机协同追捕策略[J]. 航空学报, 2022, 43 (5):
FU Xiaowei, WANG Hui, XU Zhe Cooperative pursuit strategy for multi-UAVs based on DE-MADDPG algorithm[J]. Acta Aeronautica et Astronautica Sinica, 2022, 43 (5):
15 GAN W, QU X, SONG D, et al Multi-USV cooperative chasing strategy based on obstacles assistance and deep reinforcement learning[J]. IEEE Transactions on Automation Science and Engineering, 2024, 21 (4): 5895- 5910
doi: 10.1109/TASE.2023.3319510
16 孙懿豪, 闫超, 相晓嘉, 等 基于分层强化学习的多无人机协同围捕方法[J]. 控制理论与应用, 2025, 42 (1): 96- 108
SUN Yihao, YAN Chao, XIANG Xiaojia, et al Multi-UAV collaborative pursuit method via hierarchical reinforcement learning[J]. Control Theory & Applications, 2025, 42 (1): 96- 108
doi: 10.7641/CTA.2024.30439
17 曲星儒, 江雨泽, 龙飞飞, 等 基于阶段诱导学习的多无人艇协同目标围捕策略[J]. 中国舰船研究, 2025, 20 (1): 162- 171
QU Xingru, JIANG Yuze, LONG Feifei, et al Stage-induced learning-based cooperative target hunting strategy for multiple unmanned surface vehicles[J]. Chinese Journal of Ship Research, 2025, 20 (1): 162- 171
doi: 10.19693/j.issn.1673-3185.04030
18 苏牧青, 王寅, 濮锐敏, 等 基于强化学习的多无人车协同围捕方法[J]. 工程科学学报, 2024, 46 (7): 1237- 1250
SU Muqing, WANG Yin, PU Ruimin, et al Cooperative encirclement method for multiple unmanned ground vehicles based on reinforcement learning[J]. Chinese Journal of Engineering, 2024, 46 (7): 1237- 1250
doi: 10.13374/j.issn2095-9389.2023.09.15.004
19 ZHOU W, LI J, ZHANG Q Joint communication and action learning in multi-target tracking of UAV swarms with deep reinforcement learning[J]. Drones, 2022, 6 (11): 339
doi: 10.3390/drones6110339
20 符小卫, 徐哲, 朱金冬, 等 基于PER-MATD3的多无人机攻防对抗机动决策[J]. 航空学报, 2023, 44 (7): 327083
FU Xiaowei, XU Zhe, ZHU Jindong, et al Maneuvering decision-making of multi-UAV attack-defence confrontation based on PER-MATD3[J]. Acta Aeronautica et Astronautica Sinica, 2023, 44 (7): 327083
21 严锐驰, 李帅, 王晨, 等 基于自博弈强化学习的异构无人机集群协同对抗决策方法[J]. 中国科学: 信息科学, 2024, 54 (7): 1709- 1729
YAN Ruichi, LI Shuai, WANG Chen, et al Cooperative decision-making for heterogeneous UAV swarm confrontation based on self-play reinforcement learning[J]. Scientia Sinica Informationis, 2024, 54 (7): 1709- 1729
doi: 10.1360/SSI-2023-0267
22 夏家伟, 朱旭芳, 张建强, 等 基于多智能体强化学习的无人艇协同围捕方法[J]. 控制与决策, 2023, 38 (5): 1438- 1447
XIA Jiawei, ZHU Xufang, ZHANG Jianqiang, et al Research on cooperative hunting method of unmanned surface vehicle based on multi-agent reinforcement learning[J]. Control and Decision, 2023, 38 (5): 1438- 1447
doi: 10.13195/j.kzyjc.2022.0564
23 WU C, YU W, LIAO W, et al Deep reinforcement learning with intrinsic curiosity module based trajectory tracking control for USV[J]. Ocean Engineering, 2024, 308: 118342
doi: 10.1016/j.oceaneng.2024.118342
24 YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of PPO in cooperative, multi-agent games [EB/OL]. (2022-11-04) [2025-06-22]. https://arxiv.org/abs/2103.01955.
25 SCHULMAN J, MORITZ P, LEVINE S, et al. High-dimensional continuous control using generalized advantage estimation [EB/OL]. (2018-10-20) [2025-06-22]. https://arxiv.org/abs/1506.02438.
26 任璐, 柯亚男, 柳文章, 等 基于优势函数输入扰动的多无人艇协同策略优化方法[J]. 自动化学报, 2025, 51 (4): 824- 834
REN Lu, KE Yanan, LIU Wenzhang, et al Multi-USVs cooperative policy optimization method based on disturbed input of advantage function[J]. Acta Automatica Sinica, 2025, 51 (4): 824- 834
doi: 10.16383/j.aas.c240453
[1] 汪洋,刘红超,田池,吴兵,张笛. 航线交换机制下多船避碰的策略学习与博弈决策[J]. 浙江大学学报(工学版), 2026, 60(5): 964-976.
[2] 于瑞,徐雪峰,周华,杨华勇. 基于改进切换增益自适应率的欠驱动USV滑模轨迹跟踪控制[J]. 浙江大学学报(工学版), 2022, 56(3): 436-443.