基于参考模型的输出反馈强化学习控制

doi:10.3785/j.issn.1008-973X.2013.03.003

2013, Vol. 47

Issue (3): 409-414 DOI: 10.3785/j.issn.1008-973X.2013.03.003

计算机技术

基于参考模型的输出反馈强化学习控制

郝钏钏1, 方舟2, 李平2

1. 浙江大学控制科学与工程学系,浙江杭州 310027;2. 浙江大学航空航天学院,浙江杭州 310027

Output feedback reinforcement learning control method
based on reference model

HAO Chuan-chuan1, FANG Zhou2, LI Ping2

1. Department of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China;
2. School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China

全文: PDF HTML

摘要：

现有的直接策略搜索强化学习控制算法大多是状态完全可观对象设计状态反馈控制策略,其适用的对象范围十分有限.为此,提出一种适用范围更广的模型参考输出反馈强化学习控制算法,其学习过程仅依赖对象输出,并可以获得使闭环系统具有期望动态性能的输出反馈控制策略.算法构造了以参考模型为基础的回报函数,可以有效地描述系统的期望闭环动态性能|采用以PID输出反馈控制律为基础的参数化随机控制策略,以便于利用先验知识、依据控制领域常用的PID参数经验整定方法确定较好的初始策略,以缩短学习时间|并使用具有良好学习性能的eNAC算法进行控制策略优化.二阶开环不稳定对象和某型高亚音速无人机俯仰通道线性参变(LPV)模型的学习控制仿真结果验证了算法的有效性.

Abstract:

Most existed direct policy search reinforcement learning (RL) control methods, which require full measurements of the system states and result in a state-feedback control policy, are only suitable for limited system. This work developed a novel RL control method which requires only system output and converges to an output-feedback control policy which enables the closed-loop system to have the expected dynamic properties. An informative reward function based on the reference model was adopted to effectively represent the desired closed-loop dynamics. A stochastic output-feedback control policy based on PID law was used, thus the priori knowledge could be effectively used to determine a proper initial policy through organized manual tuning techniques commonly used in the control system community. And the episodic Natural Actor-Critic (eNAC) algorithm, which is effective and has proper performance, was used for policy optimization. Simulations on a second-order unstable system and a linear parameter varying (LPV) model of unmanned air vehicle's (UAV's) longitudinal dynamics demonstrate the efficiency of the proposed algorithm.

出版日期: 2013-03-01

TP 18

基金资助:

国家自然科学基金资助项目（61004066）；浙江省科技厅公益性技术资助项目（2011C23106）；中央高校基本科研业务费专项资金资助项目（2011FZA4031）.

通讯作者: 方舟,男,副教授. E-mail: zfang@iipc.zju.edu.cn

作者简介: 郝钏钏（1984—）,男,博士生,从事无人机导航与控制和强化学习控制研究.E-mail: cchao@iipc.zju.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

引用本文:

郝钏钏, 方舟, 李平. 基于参考模型的输出反馈强化学习控制[J]. J4, 2013, 47(3): 409-414.

HAO Chuan-chuan, FANG Zhou, LI Ping. Output feedback reinforcement learning control method
based on reference model. J4, 2013, 47(3): 409-414.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2013.03.003 或 http://www.zjujournals.com/eng/CN/Y2013/V47/I3/409

［1］ TAMEI T, SHIBATA T. Fast reinforcement learning for three-dimensional kinetic human-robot cooperation with EMG-to-activation model ［J］. Advanced Robotics, 2011, 25(5): 563-580.
［2］ HAN Y K, KIMURA H. Motions obtaining of multi-degree-freedom underwater robot by using reinforcement learning algorithms ［C］∥ IEEE Region 10 Annual International Conference, Proceedings/TENCON. New Jersey: IEEE，2010: 1498-1502.
［3］ PETERS J, SCHAAL S. Natural actor-critic ［J］. Neurocomputing, 2008, 71(7/8/9): 1180-1190.
［4］ ABBEEL P. Apprenticeship learning and reinforcement learning with application to robotic control ［D］. Stanford: Department of Computer Science, Stanford University， 2008.
［5］余涛,胡细兵,刘靖.基于多步回溯Q(λ)学习算法的多目标最优潮流计算［J］.华南理工大学学报:自然科学版,2010,38(10): 139-145.
YU Tao, HU Xi-bing, LIU Jing. Multi-objective optimal power flow calculation based on multi-step Q(λ)learning algorithm ［J］. Journal of South China University of Technology: Natural Science Edition, 2010, 38(10): 139-145.
［6］ CHU B, PARK J, HONG D. Tunnel ventilation controller design using an RLS-based natural actor-critic algorithm ［J］. International Journal of Precision Engineering and Manufacturing, 2010, 11(6): 829-838.
［7］ LEWIS F L, VRABIE D. Reinforcement learning and adaptive dynamic programming for feedback control ［J］. IEEE Circuits and Systems Magazine, 009, 9(3): 32-50.
［8］ LEWIS F L, VAMVOUDAKIS K G. Optimal adaptive control for unknown systems using output feedback by reinforcement learning methods ［C］∥ Proceedings of 2010 8th IEEE International Conference on Control and Automation. New Jersey:IEEE Computer Society, 2010: 2138-2145.
［9］王学宁,陈伟,张锰,等.增强学习中的直接策略搜索方法综述［J］.智能系统学报,2007,2(1): 16-24.
WANG Xue-ning, CHEN Wei, ZHANG Meng, et al. A survey of direct policy search methods in reinforcement learning ［J］. CAAI Transaction on Intelligent Systems, 2007, 2(1): 16-24.

[1]	於俊,汪增福. 基于经验模式分解和多种评价准则的电子稳像[J]. J4, 2014, 48(3): 423-429.
[2]	刘业峰，徐冠群，潘全科，柴天佑. 磁性材料成型烧结生产调度优化方法及应用[J]. J4, 2013, 47(9): 1517-1523.
[3]	林亦宁, 韦巍, 戴渊明. 半监督Hough Forest跟踪算法[J]. J4, 2013, 47(6): 977-983.
[4]	汪鹏君, 王振海, 陈耀武, 李辉. 固定极性Reed-Muller电路最佳延时极性搜索[J]. J4, 2013, 47(2): 361-366.
[5]	李侃,黄文雄,黄忠华. 基于支持向量机的多传感器探测目标分类方法[J]. J4, 2013, 47(1): 15-22.
[6]	肖冬峰,杨春节,宋执环. 基于改进BP网络的高炉煤气发生量预测模型[J]. J4, 2012, 46(11): 2103-2108.
[7]	姚伏天, 钱沄涛, 李吉明. 空间约束半监督高斯过程下的高光谱图像分类[J]. J4, 2012, 46(7): 1295-1300.
[8]	王洪波, 赵光宙, 齐冬莲, 卢达. 一类支持向量机的快速增量学习方法[J]. J4, 2012, 46(7): 1327-1332.
[9]	艾解清, 高济, 彭艳斌, 郑志军. 基于直推式支持向量机的协商决策模型[J]. J4, 2012, 46(6): 967-973.
[10]	潘俊, 孔繁胜, 王瑞琴. 局部敏感判别直推学习机[J]. J4, 2012, 46(6): 987-994.
[11]	金卓军, 钱徽, 朱淼良. 基于倾向性分析的轨迹评测技术[J]. J4, 2011, 45(10): 1732-1737.
[12]	胡斌,李阳,高济. 基于符号模型检验的可信跨域协作系统验证方法[J]. J4, 2011, 45(9): 1558-1565.
[13]	王秀君, 胡协和. 一种改进的单神经元PID控制策略[J]. J4, 2011, 45(8): 1498-1501.
[14]	厉小军,戴霖,施寒潇,黄琦. 文本倾向性分析综述[J]. J4, 2011, 45(7): 1167-1174.
[15]	顾弘, 赵光宙. 广义局部图像距离函数下的图像分类与识别[J]. J4, 2011, 45(4): 596-601.

Viewed

Full text

Abstract

Cited

Shared

Discussed