Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (3): 480-491    DOI: 10.3785/j.issn.1008-973X.2024.03.005
计算机技术     
两方零和马尔科夫博弈策略梯度算法及收敛性分析
王卓,李永强*(),冯宇,冯远静
1. 浙江工业大学 信息工程学院,浙江 杭州 310000
Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games
Zhuo WANG,Yongqiang LI*(),Yu FENG,Yuanjing FENG
1. School of Information Engineering, Zhejiang University of Technology, Hangzhou 310000, China
 全文: PDF(1535 KB)   HTML
摘要:

为了解决基于策略的强化学习方法在两方零和马尔科夫博弈中学习效率低下的问题, 提出同时更新双方玩家策略的近似纳什均衡策略优化算法. 将两方零和马尔科夫博弈问题描述为最大最小优化问题, 针对参数化策略, 给出马尔科夫博弈的策略梯度定理, 并通过近似随机策略梯度的推导, 为算法实施提供可行性基础. 通过比较分析不同的最大最小问题梯度更新方法, 发现额外梯度相较于其他方法具有更好的收敛性能. 基于这一发现, 提出基于额外梯度的近似纳什均衡策略优化算法, 并给出算法的收敛性证明. 在Oshi-Zumo游戏上, 使用表格式softmax参数化策略以及神经网络作为参数化策略, 验证不同游戏规模场景下算法的有效性. 通过对比实验, 验证算法相对于其他方法的收敛性和优越性.

关键词: 两方零和马尔科夫博弈强化学习策略优化额外梯度纳什均衡神经网络    
Abstract:

An approximate Nash equilibrium policy optimization algorithm that simultaneously updated the policy of both players was proposed, in order to resolve the problem of low learning efficiency of the policy-based reinforcement learning method in the two-player zero-sum Markov game. The two-player zero-sum Markov game problem was described as a maximum-minimum optimization problem. The policy gradient theorem of the Markov game was given for the parameterized policy, and it provided a feasibility basis for algorithm implementation through the derivation of the approximate stochastic policy gradient. Different gradient update methods for the maximum-minimum problem were compared and analyzed, and it was found that the extragradient had better convergence performance than other methods. An approximate Nash equilibrium policy optimization algorithm based on the extragradient was proposed based on this finding, and the convergence proof of the algorithm was given. The tabular softmax parameterized policy and the neural network were used as parameterized policy on the Oshi-Zumo game, to verify the effectiveness of the algorithm in different game scale scenarios. The convergence and superiority of the algorithm compared to other methods were verified through comparative experiments.

Key words: two-player zero-sum Markov game    reinforcement learning    policy optimization    extragradient    Nash equilibrium    neural network
收稿日期: 2023-04-13 出版日期: 2024-03-05
CLC:  TU 18  
基金资助: 国家自然科学基金资助项目(62073294);浙江省自然科学基金资助项目(LZ21F030003).
通讯作者: 李永强     E-mail: yqli@zjut.edu.cn
作者简介: 王卓(1998—),男,硕士生,从事多智能体强化学习与博弈论研究. orcid.org/0009-0007-3778-8523. E-mail:2112103098@zjut.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王卓
李永强
冯宇
冯远静

引用本文:

王卓,李永强,冯宇,冯远静. 两方零和马尔科夫博弈策略梯度算法及收敛性分析[J]. 浙江大学学报(工学版), 2024, 58(3): 480-491.

Zhuo WANG,Yongqiang LI,Yu FENG,Yuanjing FENG. Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 480-491.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.03.005        https://www.zjujournals.com/eng/CN/Y2024/V58/I3/480

图 1  不同更新规则的探索轨迹
图 2  Oshi-Zumo在不同设定下的状态变化示意图
参数$ \lambda $$ {\alpha _k} $$\gamma $$ {k_{\max }} $$n$
表格式Softmax0.30.51.01061
神经网络0.10.31.08×1061
表 1  不同参数化策略设定下的算法超参数
图 3  策略参数化下MG-PG算法的纳什收敛指标
图 4  神经网络下MG-PG算法的纳什收敛指标
图 5  4种对比算法的纳什收敛指标图
1 SILVER D, HUANG A, MADDISON C J, et al Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529 (7587): 484- 489
doi: 10.1038/nature16961
2 BROWN N, SANDHOLM T Superhuman AI for multiplayer poker[J]. Science, 2019, 365 (6456): 885- 890
doi: 10.1126/science.aay2400
3 OWEN G. Game theory [M]. Bradford: Emerald Group Publishing, 2013.
4 吴哲, 李凯, 徐航, 等 一种用于两人零和博弈对手适应的元策略演化学习算法[J]. 自动化学报, 2022, 48 (10): 2462- 2473
WU Zhe, LI Kai, XU Hang, et al A meta-evolutionary learning algorithm for opponent adaptation in two-player zero-sum games[J]. Acta Automatica Sinica, 2022, 48 (10): 2462- 2473
5 LITTMAN M L. Markov games as a framework for multi-agent reinforcement learning [C]// Proceedings of the 11th International Conference on Machine Learning . New Brunswick: Morgan Kaufmann Publishers Inc, 1994: 157−163.
6 WATKINS C J, DAYAN P Q-learning[J]. Machine Learning, 1992, 8 (1): 279- 292
7 FAN J, WANG Z, XIE Y, et al. A theoretical analysis of deep Q-learning [C]// Proceedings of the 2nd Learning for Dynamics and Control . Zürich: PMLR, 2020: 486−489.
8 JIA Z, YANG L F, WANG M. Feature-based q-learning for two-player stochastic games [EB/OL]. (2019−01−02) [2023−12−20]. https://arxiv.org/abs/1906.00423.pdf.
9 SIDFORD A, WANG M, YANG L, et al. Solving discounted stochastic two-player games with near-optimal time and sample complexity [C]// Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics . [s.l.]: PMLR, 2020: 2992−3002.
10 SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [EB/OL]. (2017-08-28) [2023-12-20]. https://arxiv.org/pdf/1707.06347.pdf.
11 HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]// Proceedings of the 35th International Conference on Machine Learning . Stockholmsmässan: PMLR, 2018: 1861-1870.
12 MAZUMDAR E V, JORDAN M I, SASTRY S S. On finding local nash equilibria (and only local nash equilibria) in zero-sum games [EB/OL]. (2019-01-25) [2023-12-21]. https://arxiv.org/pdf/1901.00838.pdf.
13 BU J, RATLIFF L J, MESBAHI M. Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games [EB/OL]. (2019-09-12) [2023-12-21]. https://arxiv.org/pdf/1911.04672.pdf.
14 DASKALAKIS C, FOSTER D J, GOLOWICH N. Independent policy gradient methods for competitive reinforcement learning [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver: Curran Associates Inc, 2020: 5527−5540.
15 NOUIEHED M, SANJABI M, HUANG T, et al. Solving a class of non-convex min-max games using iterative first order methods [C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems . Vancouver: Curran Associates Inc, 2019: 14905−14916.
16 FIEZ T, CHASNOV B, RATLIFF L J. Convergence of learning dynamics in stackelberg games [EB/OL]. (2019-09-06) [2023-12-23]. https://arxiv.org/pdf/1906.01217.pdf.
17 PEROLAT J, PIOT B, PIETQUIN O. Actor-critic fictitious play in simultaneous move multistage games [C]// Proceedings of the 21st International Conference on Artificial Intelligence and Statistics . Playa Blanca: PMLR, 2018: 919−928.
18 PETHICK T, PATRINOS P, FERCOQ O, et al. Escaping limit cycles: global convergence for constrained nonconvex-nonconcave minimax problems [C]// The 10th International Conference on Learning Representations . France: Online, 2022: 03602455.
19 LAGOUDAKIS M, PARR R. Value function approximation in zero-sum markov games [EB/OL]. (2012-12-12) [2023-12-23]. https://arxiv.org/pdf/1301.0580.pdf.
20 FUDENBERG D, DREW F, LEVINE D K, et al. The theory of learning in games [M]. Massachusetts: MIT Press, 1998.
21 NASH JR J F Equilibrium points in n-person games[J]. National Academy of Sciences, 1950, 36 (1): 48- 49
doi: 10.1073/pnas.36.1.48
22 VAMVOUDAKIS K G, WAN Y, LEWIS F L, et al. Handbook of reinforcement learning and control [M]. Berlin: Springer, 2021.
23 PEROLAT J, SCHERRER B, PIOT B, et al. Approximate dynamic programming for two-player zero-sum Markov games [C]// Proceedings of the 32nd International Conference on Machine Learning . Lille: PMLR, 2015: 1321-1329.
24 LANCTOT M, LOCKHART E, LESPIAU J-B, et al. OpenSpiel: a framework for reinforcement learning in games [EB/OL]. (2020-09-26) [2023-12-24]. https://arxiv.org/pdf/1908.09453.pdf.
25 PéROLAT J, PIOT B, GEIST M, et al. Softened approximate policy iteration for Markov games [C]// Proceedings of the the 33rd International Conference on Machine Learning . New York: PMLR, 2016: 1860−1868.
[1] 姚鑫骅,于涛,封森文,马梓健,栾丛丛,沈洪垚. 基于图神经网络的零件机加工特征识别方法[J]. 浙江大学学报(工学版), 2024, 58(2): 349-359.
[2] 汤雪扬,蔡小培,王伟华,常文浩,王启好. 基于粒子概率神经网络算法的钢轨波磨识别[J]. 浙江大学学报(工学版), 2023, 57(9): 1766-1774.
[3] 宋昭漾,赵小强,惠永永,蒋红梅. 基于多级连续编码与解码的图像超分辨率重建算法[J]. 浙江大学学报(工学版), 2023, 57(9): 1885-1893.
[4] 王殿海,谢瑞,蔡正义. 基于最优汇集时间间隔的城市间断交通流预测[J]. 浙江大学学报(工学版), 2023, 57(8): 1607-1617.
[5] 宋秀兰,董兆航,单杭冠,陆炜杰. 基于时空融合的多头注意力车辆轨迹预测[J]. 浙江大学学报(工学版), 2023, 57(8): 1636-1643.
[6] 薛雅丽,叶金泽,李寒雁. 基于改进强化学习的多智能体追逃对抗[J]. 浙江大学学报(工学版), 2023, 57(8): 1479-1486.
[7] 权巍,蔡永青,王超,宋佳,孙鸿凯,李林轩. 基于3D-ResNet双流网络的VR病评估模型[J]. 浙江大学学报(工学版), 2023, 57(7): 1345-1353.
[8] 周欣磊,顾海挺,刘晶,许月萍,耿芳,王冲. 基于集成学习与深度学习的日供水量预测方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1120-1127.
[9] 张海波,蔡磊,任俊平,王汝言,刘富. 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(6): 1205-1214.
[10] 覃承进,蒋俊正. 基于深度神经网络的雷达距离超分辨方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1215-1223.
[11] 王誉翔,钟智伟,夏鹏程,黄亦翔,刘成良. 基于改进Transformer的复合故障解耦诊断方法[J]. 浙江大学学报(工学版), 2023, 57(5): 855-864.
[12] 吕鑫栋,李娇,邓真楠,冯浩,崔欣桐,邓红霞. 基于改进Transformer的结构化图像超分辨网络[J]. 浙江大学学报(工学版), 2023, 57(5): 865-874.
[13] 张剑钊,郭继昌,汪昱东. 基于融合逆透射率图的水下图像增强算法[J]. 浙江大学学报(工学版), 2023, 57(5): 921-929.
[14] 周传华,操礼春,周家亿,詹凤. 图卷积融合计算时效网络节点重要性评估分析[J]. 浙江大学学报(工学版), 2023, 57(5): 930-938.
[15] 熊帆,陈田,卞佰成,刘军. 基于卷积循环神经网络的芯片表面字符识别[J]. 浙江大学学报(工学版), 2023, 57(5): 948-956.