Please wait a minute...
JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE)
    
Natural Actor-Critic based on batch recursive least-squares
WANG Guo-fang, FANG Zhou, LI Ping
School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China
Download:   PDF(1340KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The algorithm called natural actor-critic based on batch recursive least-squares (NAC-BRLS) was proposed in order to reduce the online computation burden of the agent and improve the real-time operation. The algorithm employed batch recursive least-squares in Critic to evaluate the natural gradient, and performed optimistic update in Actor by the estimated natural gradient. The use of batch recursive least-squares enables the agent to adjust the date size of every batch according to its operational capability. A trade-off between fully optimistic and partially optimistic was made, improving the flexibility of NAC-LSTD. Simulation results in mountain car show that NAC-BRLS largely reduces the computational complexity without obviously affecting the convergence property compared with NAC-LSTD.



Published: 10 September 2015
CLC:  TP 18  
Cite this article:

WANG Guo-fang, FANG Zhou, LI Ping. Natural Actor-Critic based on batch recursive least-squares. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(7): 1335-1342.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2015.07.019     OR     http://www.zjujournals.com/eng/Y2015/V49/I7/1335


基于批量递归最小二乘的自然Actor-Critic算法

为了减轻Actor-Critic结构中智能体用最小二乘法估计自然梯度时的在线运算负担,提高运算实时性,提出新的学习算法:NAC-BRLS.该算法在Critic中利用批量递归最小二乘法估计自然梯度,根据估计得到的梯度乐观地更新策略.批量递归最小二乘法的引入使得智能体能根据自身运算能力自由调整各批次运算的数据量,即每次策略估计时使用的数据量,在全乐观和部分乐观之间进行权衡,大大提高了NAC-LSTD算法的灵活性.山地车仿真实验表明, 与NAC-LSTD算法相比,NAC-BRLS算法在保证一定收敛性能的前提下,能够明显降低智能体的单步平均运算负担.

[1] SUTTON R S, BARTO A G. Introduction to reinforcement learning [M]. Cambridge: MIT, 1998.
[2]PETERS J, SCHAAL S. Natural actor-critic [J]. Neurocomputing, 2008, 71(7): 1180-1190.
[3]GRONDMAN I, BUSONIU L, LOPES G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6): 1291-1307.
[4]程玉虎, 冯涣婷, 王雪松. 基于参数探索的期望最大化策略搜索[J]. 自动化学报, 2012, 38(1): 38-45.
CHENG Yu-hu, FENG Huan-ting, WANG Xue-song. Expectation-maximization policy search with parameter-based exploration [J]. Acta Automatica Sinica, 2012, 38(1): 38-45.
[5]BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor:critic algorithms [J]. Automatica, 2009, 45(11): 2471-2482.
[6]SUTTON R S. Learning to predict by the methods of temporal differences [J]. Machine Learning, 1988, 3(1): 944.
[7]ADAM S, BUSONIU L, BABUSKA R. Experience replay for real-time reinforcement learning control [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(2): 201-212.
[8]BRADTKE S J, BARTO A G. Linear least-squares algorithms for temporal difference learning [J]. Machine Learning, 1996, 22(1/2/3): 33-57.
[9]BOYAN J A. Technical update: least-squares temporal difference learning [J]. Machine Learning, 2002, 49(2/3): 233-246.
[10] DANN C, NEUMANN G, PETERS J. Policy evaluation with temporal differences: a survey and comparison [J]. The Journal of Machine Learning Research, 2014, 15(1): 809-883.
[11] GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation [M]∥Modeling Decisions for Artificial Intelligence. Berlin: Springer, 2010: 207-218.
[12] CHENG Yu-hu, FENG Huan-ting, WANG Xue-song. Efficient data use in incremental actor:critic algorithms [J]. Neurocomputing, 2013, 116(10): 346-354.
[13] XU Xin, HE Han-gen, HU De-wen. Efficient reinforcement learning using recursive least-squares methods [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 259-292.
[14] AMARI S I. Natural gradient works efficiently in learning [J]. Neural computation, 1998, 10(2): 251-276.
[15] BUSONIU L, ERNST D, SCHUTTER B D, et al. Online least-squares policy iteration for reinforcement learning control [C]∥American Control Conference (ACC). Baltimore: IEEE, 2010: 486-491.
[16] 郝钏钏, 方舟, 李平. 采用经验复用的高效强化学习控制方法[J]. 华南理工大学学报:自然科学版, 2012, 40(6): 70-75.
HAO Chuan-chuan, FANG Zhou, LI Ping. Efficient reinforcement-learning control algorithm using experience reuse [J]. Journal of South China University of Technology: Natural Science Edition, 2012, 40(6): 70-75.
[17] SUTTON R S, MCALLESTER D A, SINGH S P, et al. Policy gradient methods for reinforcement learning with function approximation [C]∥ Advances in Neural Information Processing Systems(NIPS). Denver: MIT, 1999, 99: 1057-1063.
[18] MOUSTAKIDES G V. Study of the transient phase of the forgetting factor RLS [J]. IEEE Transactions on Signal Processing, 1997, 45(10): 2468-2476.
[19] LJUNG L, SDERSTRM T. Theory and practice of recursive identification [M]. Cambridge: MIT, 1983.
[20] HACHIYA H, AKIYAMA T, SUGIAYMA M, et al. Adaptive importance sampling for value function approximation in off-policy reinforcement learning [J]. Neural Networks, 2009, 22(10): 1399-1410.
[21] HUANG Zhen-hua, XU Xin, ZUO Lei. Reinforcement learning with automatic basis construction based on isometric feature mapping [J]. Information Sciences, 2014, 286(1): 209-227.

[1] ZHU Dong-yang, SHEN Jing-yi, HUANG Wei-ping, LIANG Jun. Fault classification based on modified active learning and weighted SVM[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(4): 697-705.
[2] FENG Xiao yue, LIANG Yan chun, LIN Xi xun, GUAN Ren chu. Research and development of never-ending language learning[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2017, 51(1): 82-88.
[3] QIU Ri hui, LIU Kang ling, TAN Hai long, LIANG Jun. Classification algorithm based on extreme learning machine and its application in fault identification of Tennessee Eastman process[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(10): 1965-1972.
[4] JI Yu,QIU Qing ying,FENG Pei en,HUANG Hao. Extraction and utilization of design knowledge in international patent classification[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2016, 50(3): 412-418.
[5] JU Bin, QIAN Yun-tao, YE Min-chao. Collaborative filtering algorithm based on structured projective nonnegative matrix factorization[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(7): 1319-1325.
[6] TAN Hailong, LIU Kangling, JIN Xin, SHI Xiang rong, LIANG Jun. Multivariate time series classification based on μσ-DWC feature and tree-structured M-SVM[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(6): 1061-1069.
[7] MIAO Feng, XIE An-huan, WANG Fu-an, YU Feng, ZHOU Hua. Method for multi-stage alternative grouping parallel machines scheduling problem[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(5): 866-872.
[8] WANG Li-jun, HUANG Zhong-chao, ZHAO Yu-qian. New spatial-coherent latent topic model based on super-pixel segmentation and scene classification method[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(3): 402-408.
[9] WANG Cheng-long, LI Cheng, FENG Yi-ping, RONG Gang. Dispatching rule extraction method for job shop scheduling problem[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(3): 421-429.
[10] YU Jun, WANG Zeng-fu. Video stabilization based on empirical mode decomposition and
several evaluation criterions
[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2014, 48(3): 423-429.
[11] LIU Ye-feng, XU Guan-qun, PAN Quan-ke, CHAI Tian-you. Magnetic material molding sintering production scheduling optimization method and its application[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2013, 47(9): 1517-1523.
[12] LIN Yi-ning, WEI Wei, DAI Yuan-ming. Semi-supervised Hough Forest tracking method[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2013, 47(6): 977-983.
[13] HAO Chuan-chuan, FANG Zhou, LI Ping. Output feedback reinforcement learning control method
 based on reference model
[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2013, 47(3): 409-414.
[14] WANG Peng-jun, WANG Zhen-hai, CHEN Yao-wu, LI Hui. Searching the best polarity for fixed polarity Reed-Muller
circuits based on delay model
[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2013, 47(2): 361-366.
[15] LI Kan, HUANG Wen-xiong, HUANG Zhong-hua. Multi-sensor detected object classification method based on
support vector machine
[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2013, 47(1): 15-22.