基于批量递归最小二乘的自然Actor-Critic算法

doi:10.3785/j.issn.1008-973X.2015.07.019

浙江大学学报(工学版)

通信工程、自动化技术

基于批量递归最小二乘的自然Actor-Critic算法

王国芳, 方舟, 李平

浙江大学航空航天学院,浙江杭州 310027

Natural Actor-Critic based on batch recursive least-squares

WANG Guo-fang, FANG Zhou, LI Ping

School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China

全文: PDF(1340 KB) HTML

摘要：

为了减轻Actor-Critic结构中智能体用最小二乘法估计自然梯度时的在线运算负担,提高运算实时性,提出新的学习算法：NAC-BRLS.该算法在Critic中利用批量递归最小二乘法估计自然梯度,根据估计得到的梯度乐观地更新策略.批量递归最小二乘法的引入使得智能体能根据自身运算能力自由调整各批次运算的数据量,即每次策略估计时使用的数据量,在全乐观和部分乐观之间进行权衡,大大提高了NAC-LSTD算法的灵活性.山地车仿真实验表明, 与NAC-LSTD算法相比,NAC-BRLS算法在保证一定收敛性能的前提下,能够明显降低智能体的单步平均运算负担.

Abstract:

The algorithm called natural actor-critic based on batch recursive least-squares (NAC-BRLS) was proposed in order to reduce the online computation burden of the agent and improve the real-time operation. The algorithm employed batch recursive least-squares in Critic to evaluate the natural gradient, and performed optimistic update in Actor by the estimated natural gradient. The use of batch recursive least-squares enables the agent to adjust the date size of every batch according to its operational capability. A trade-off between fully optimistic and partially optimistic was made, improving the flexibility of NAC-LSTD. Simulation results in mountain car show that NAC-BRLS largely reduces the computational complexity without obviously affecting the convergence property compared with NAC-LSTD．

出版日期: 2015-09-10

TP 18

基金资助:

国家自然科学基金资助项目(61004066);浙江省自然科学基金资助项目(LY15F030005)

通讯作者: 方舟，男，副教授 E-mail: zfang@zju.edu.cn

作者简介: 王国芳(1989-)，男，博士生，从事强化学习、迁移学习的研究.E-mail: gfwang89@zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

王国芳, 方舟, 李平. 基于批量递归最小二乘的自然Actor-Critic算法[J]. 浙江大学学报(工学版), 10.3785/j.issn.1008-973X.2015.07.019.

WANG Guo-fang, FANG Zhou, LI Ping. Natural Actor-Critic based on batch recursive least-squares. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 10.3785/j.issn.1008-973X.2015.07.019.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2015.07.019 或 http://www.zjujournals.com/eng/CN/Y2015/V49/I7/1335

［1］ SUTTON R S, BARTO A G. Introduction to reinforcement learning ［M］. Cambridge: MIT, 1998．
［2］PETERS J, SCHAAL S. Natural actor-critic ［J］. Neurocomputing, 2008, 71(7): 1180-1190．
［3］GRONDMAN I, BUSONIU L, LOPES G A D, et al. A survey of actor-critic reinforcement learning: standard and natural policy gradients ［J］. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6): 1291-1307.
［4］程玉虎, 冯涣婷, 王雪松. 基于参数探索的期望最大化策略搜索［J］. 自动化学报, 2012, 38(1): 38-45.
CHENG Yu-hu, FENG Huan-ting, WANG Xue-song. Expectation-maximization policy search with parameter-based exploration ［J］. Acta Automatica Sinica, 2012, 38(1): 38-45．
［5］BHATNAGAR S, SUTTON R S, GHAVAMZADEH M, et al. Natural actor:critic algorithms ［J］. Automatica, 2009, 45(11): 2471-2482．
［6］SUTTON R S. Learning to predict by the methods of temporal differences ［J］. Machine Learning, 1988, 3(1): 944．
［7］ADAM S, BUSONIU L, BABUSKA R. Experience replay for real-time reinforcement learning control ［J］. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(2): 201-212．
［8］BRADTKE S J, BARTO A G. Linear least-squares algorithms for temporal difference learning ［J］. Machine Learning, 1996, 22(1/2/3): 33-57．
［9］BOYAN J A. Technical update: least-squares temporal difference learning ［J］. Machine Learning, 2002, 49(2/3): 233-246．
［10］ DANN C, NEUMANN G, PETERS J. Policy evaluation with temporal differences: a survey and comparison ［J］. The Journal of Machine Learning Research, 2014, 15(1): 809-883．
［11］ GEIST M, PIETQUIN O. Revisiting natural actor-critics with value function approximation ［M］∥Modeling Decisions for Artificial Intelligence. Berlin: Springer, 2010: 207-218．
［12］ CHENG Yu-hu, FENG Huan-ting, WANG Xue-song. Efficient data use in incremental actor:critic algorithms ［J］. Neurocomputing, 2013, 116(10): 346-354．
［13］ XU Xin, HE Han-gen, HU De-wen. Efficient reinforcement learning using recursive least-squares methods ［J］. Journal of Artificial Intelligence Research, 2002, 16(1): 259-292.
［14］ AMARI S I. Natural gradient works efficiently in learning ［J］. Neural computation, 1998, 10(2): 251-276．
［15］ BUSONIU L, ERNST D, SCHUTTER B D, et al. Online least-squares policy iteration for reinforcement learning control ［C］∥American Control Conference (ACC). Baltimore: IEEE, 2010: 486-491．
［16］郝钏钏, 方舟, 李平. 采用经验复用的高效强化学习控制方法［J］. 华南理工大学学报：自然科学版, 2012, 40(6): 70-75.
HAO Chuan-chuan, FANG Zhou, LI Ping. Efficient reinforcement-learning control algorithm using experience reuse ［J］. Journal of South China University of Technology: Natural Science Edition, 2012, 40(6): 70-75．
［17］ SUTTON R S, MCALLESTER D A, SINGH S P, et al. Policy gradient methods for reinforcement learning with function approximation ［C］∥ Advances in Neural Information Processing Systems(NIPS). Denver: MIT, 1999, 99: 1057-1063．
［18］ MOUSTAKIDES G V. Study of the transient phase of the forgetting factor RLS ［J］. IEEE Transactions on Signal Processing, 1997, 45(10): 2468-2476．
［19］ LJUNG L, SDERSTRM T. Theory and practice of recursive identification ［M］. Cambridge: MIT, 1983．
［20］ HACHIYA H, AKIYAMA T, SUGIAYMA M, et al. Adaptive importance sampling for value function approximation in off-policy reinforcement learning ［J］. Neural Networks, 2009, 22(10): 1399-1410．
［21］ HUANG Zhen-hua, XU Xin, ZUO Lei. Reinforcement learning with automatic basis construction based on isometric feature mapping ［J］. Information Sciences, 2014, 286(1): 209-227.

[1]	朱东阳, 沈静逸, 黄炜平, 梁军. 基于主动学习和加权支持向量机的工业故障识别[J]. 浙江大学学报(工学版), 2017, 51(4): 697-705.
[2]	丰小月, 梁艳春, 林希珣, 管仁初. 永恒语言学习研究与发展[J]. 浙江大学学报(工学版), 2017, 51(1): 82-88.
[3]	裘日辉, 刘康玲, 谭海龙, 梁军. 基于极限学习机的分类算法及在故障识别中的应用[J]. 浙江大学学报(工学版), 2016, 50(10): 1965-1972.
[4]	冀瑜,邱清盈,冯培恩,黄浩. 国际专利分类表中设计知识的提取和利用[J]. 浙江大学学报(工学版), 2016, 50(3): 412-418.
[5]	居斌, 钱沄涛, 叶敏超. 基于结构投影非负矩阵分解的协同过滤算法[J]. 浙江大学学报(工学版), 2015, 49(7): 1319-1325.
[6]	谭海龙, 刘康玲, 金鑫, 石向荣, 梁军. 基于μσ-DWC特征和树结构M-SVM的多维时间序列分类[J]. 浙江大学学报(工学版), 2015, 49(6): 1061-1069.
[7]	苗峰,谢安桓,王富安,喻峰,周华. 多阶段可替换分组并行机调度问题的求解[J]. 浙江大学学报(工学版), 2015, 49(5): 866-872.
[8]	王立军,黄忠朝,赵于前. 基于超像素分割的空间相关主题模型及场景分类方法[J]. 浙江大学学报(工学版), 2015, 49(3): 402-408.
[9]	王成龙,李诚,冯毅萍,荣冈. 作业车间调度规则的挖掘方法研究[J]. 浙江大学学报(工学版), 2015, 49(3): 421-429.
[10]	於俊,汪增福. 基于经验模式分解和多种评价准则的电子稳像[J]. J4, 2014, 48(3): 423-429.
[11]	刘业峰，徐冠群，潘全科，柴天佑. 磁性材料成型烧结生产调度优化方法及应用[J]. J4, 2013, 47(9): 1517-1523.
[12]	林亦宁, 韦巍, 戴渊明. 半监督Hough Forest跟踪算法[J]. J4, 2013, 47(6): 977-983.
[13]	郝钏钏, 方舟, 李平. 基于参考模型的输出反馈强化学习控制[J]. J4, 2013, 47(3): 409-414.
[14]	汪鹏君, 王振海, 陈耀武, 李辉. 固定极性Reed-Muller电路最佳延时极性搜索[J]. J4, 2013, 47(2): 361-366.
[15]	李侃,黄文雄,黄忠华. 基于支持向量机的多传感器探测目标分类方法[J]. J4, 2013, 47(1): 15-22.

Viewed

Full text

Abstract

Cited

Shared

Discussed