通信工程、自动化技术 |
|
|
|
|
基于批量递归最小二乘的自然Actor-Critic算法 |
王国芳, 方舟, 李平 |
浙江大学 航空航天学院,浙江 杭州 310027 |
|
Natural Actor-Critic based on batch recursive least-squares |
WANG Guo-fang, FANG Zhou, LI Ping |
School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China |
[1] SUTTON R S, BARTO A G. Introduction to reinforcement learning [M]. Cambridge: MIT, 1998.
[2]PETERS J, SCHAAL S. Natural actor-critic [J]. Neurocomputing, 2008, 71(7): 1180-1190.
[3]GRONDMAN I, BUSONIU L, LOPES G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6): 1291-1307.
[4]程玉虎, 冯涣婷, 王雪松. 基于参数探索的期望最大化策略搜索[J]. 自动化学报, 2012, 38(1): 38-45.
CHENG Yu-hu, FENG Huan-ting, WANG Xue-song. Expectation-maximization policy search with parameter-based exploration [J]. Acta Automatica Sinica, 2012, 38(1): 38-45.
[5]BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor:critic algorithms [J]. Automatica, 2009, 45(11): 2471-2482.
[6]SUTTON R S. Learning to predict by the methods of temporal differences [J]. Machine Learning, 1988, 3(1): 944.
[7]ADAM S, BUSONIU L, BABUSKA R. Experience replay for real-time reinforcement learning control [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(2): 201-212.
[8]BRADTKE S J, BARTO A G. Linear least-squares algorithms for temporal difference learning [J]. Machine Learning, 1996, 22(1/2/3): 33-57.
[9]BOYAN J A. Technical update: least-squares temporal difference learning [J]. Machine Learning, 2002, 49(2/3): 233-246.
[10] DANN C, NEUMANN G, PETERS J. Policy evaluation with temporal differences: a survey and comparison [J]. The Journal of Machine Learning Research, 2014, 15(1): 809-883.
[11] GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation [M]∥Modeling Decisions for Artificial Intelligence. Berlin: Springer, 2010: 207-218.
[12] CHENG Yu-hu, FENG Huan-ting, WANG Xue-song. Efficient data use in incremental actor:critic algorithms [J]. Neurocomputing, 2013, 116(10): 346-354.
[13] XU Xin, HE Han-gen, HU De-wen. Efficient reinforcement learning using recursive least-squares methods [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 259-292.
[14] AMARI S I. Natural gradient works efficiently in learning [J]. Neural computation, 1998, 10(2): 251-276.
[15] BUSONIU L, ERNST D, SCHUTTER B D, et al. Online least-squares policy iteration for reinforcement learning control [C]∥American Control Conference (ACC). Baltimore: IEEE, 2010: 486-491.
[16] 郝钏钏, 方舟, 李平. 采用经验复用的高效强化学习控制方法[J]. 华南理工大学学报:自然科学版, 2012, 40(6): 70-75.
HAO Chuan-chuan, FANG Zhou, LI Ping. Efficient reinforcement-learning control algorithm using experience reuse [J]. Journal of South China University of Technology: Natural Science Edition, 2012, 40(6): 70-75.
[17] SUTTON R S, MCALLESTER D A, SINGH S P, et al. Policy gradient methods for reinforcement learning with function approximation [C]∥ Advances in Neural Information Processing Systems(NIPS). Denver: MIT, 1999, 99: 1057-1063.
[18] MOUSTAKIDES G V. Study of the transient phase of the forgetting factor RLS [J]. IEEE Transactions on Signal Processing, 1997, 45(10): 2468-2476.
[19] LJUNG L, SDERSTRM T. Theory and practice of recursive identification [M]. Cambridge: MIT, 1983.
[20] HACHIYA H, AKIYAMA T, SUGIAYMA M, et al. Adaptive importance sampling for value function approximation in off-policy reinforcement learning [J]. Neural Networks, 2009, 22(10): 1399-1410.
[21] HUANG Zhen-hua, XU Xin, ZUO Lei. Reinforcement learning with automatic basis construction based on isometric feature mapping [J]. Information Sciences, 2014, 286(1): 209-227. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|