Please wait a minute...
J4  2013, Vol. 47 Issue (3): 409-414    DOI: 10.3785/j.issn.1008-973X.2013.03.003
    
Output feedback reinforcement learning control method
 based on reference model
HAO Chuan-chuan1, FANG Zhou2, LI Ping2
1. Department of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China;
2. School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China
Download:   PDF(0KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

Most existed direct policy search reinforcement learning (RL) control methods, which require full measurements of the system states and result in a state-feedback control policy, are only suitable for limited system. This work developed a novel RL control method which requires only system output and converges to an output-feedback control policy which enables the closed-loop system to have the expected dynamic properties. An informative reward function based on the reference model was adopted to effectively represent the desired closed-loop dynamics. A stochastic output-feedback control policy based on PID law was used, thus the priori knowledge could be effectively used to determine a proper initial policy through organized manual tuning techniques commonly used in the control system community. And the episodic Natural Actor-Critic (eNAC) algorithm, which is effective and has proper performance, was used for policy optimization. Simulations on a second-order unstable system and a linear parameter varying (LPV) model of unmanned air vehicle's (UAV's) longitudinal dynamics demonstrate the efficiency of the proposed algorithm.



Published: 01 March 2013
CLC:  TP 18  
  TP 273.22  
Cite this article:

HAO Chuan-chuan, FANG Zhou, LI Ping. Output feedback reinforcement learning control method
 based on reference model. J4, 2013, 47(3): 409-414.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2013.03.003     OR     http://www.zjujournals.com/eng/Y2013/V47/I3/409


基于参考模型的输出反馈强化学习控制

现有的直接策略搜索强化学习控制算法大多是状态完全可观对象设计状态反馈控制策略,其适用的对象范围十分有限.为此,提出一种适用范围更广的模型参考输出反馈强化学习控制算法,其学习过程仅依赖对象输出,并可以获得使闭环系统具有期望动态性能的输出反馈控制策略.算法构造了以参考模型为基础的回报函数,可以有效地描述系统的期望闭环动态性能|采用以PID输出反馈控制律为基础的参数化随机控制策略,以便于利用先验知识、依据控制领域常用的PID参数经验整定方法确定较好的初始策略,以缩短学习时间|并使用具有良好学习性能的eNAC算法进行控制策略优化.二阶开环不稳定对象和某型高亚音速无人机俯仰通道线性参变(LPV)模型的学习控制仿真结果验证了算法的有效性.

[1] TAMEI T, SHIBATA T. Fast reinforcement learning for three-dimensional kinetic human-robot cooperation with EMG-to-activation model [J]. Advanced Robotics, 2011, 25(5): 563-580.
[2] HAN Y K, KIMURA H. Motions obtaining of multi-degree-freedom underwater robot by using reinforcement learning algorithms [C]∥ IEEE Region 10 Annual International Conference, Proceedings/TENCON. New Jersey: IEEE,2010: 1498-1502.
[3] PETERS J, SCHAAL S. Natural actor-critic [J]. Neurocomputing, 2008, 71(7/8/9): 1180-1190.
[4] ABBEEL P. Apprenticeship learning and reinforcement learning with application to robotic control [D]. Stanford: Department of Computer Science, Stanford University, 2008.
[5] 余涛,胡细兵,刘靖.基于多步回溯Q(λ)学习算法的多目标最优潮流计算[J].华南理工大学学报:自然科学版,2010,38(10): 139-145.
YU Tao, HU Xi-bing, LIU Jing. Multi-objective optimal power flow calculation based on multi-step Q(λ)learning algorithm [J]. Journal of South China University of Technology: Natural Science Edition, 2010, 38(10): 139-145.
[6] CHU B, PARK J, HONG D. Tunnel ventilation controller design using an RLS-based natural actor-critic algorithm [J]. International Journal of Precision Engineering and Manufacturing, 2010, 11(6): 829-838.
[7] LEWIS F L, VRABIE D. Reinforcement learning and adaptive dynamic programming for feedback control [J]. IEEE Circuits and Systems Magazine, 009, 9(3): 32-50.
[8] LEWIS F L, VAMVOUDAKIS K G. Optimal adaptive control for unknown systems using output feedback by reinforcement learning methods [C]∥ Proceedings of 2010 8th IEEE International Conference on Control and Automation. New Jersey:IEEE Computer Society, 2010: 2138-2145.
[9] 王学宁,陈伟,张锰,等.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1): 16-24.
WANG Xue-ning, CHEN Wei, ZHANG Meng, et al. A survey of direct policy search methods in reinforcement learning [J]. CAAI Transaction on Intelligent Systems, 2007, 2(1): 16-24.

[1] YU Jun, WANG Zeng-fu. Video stabilization based on empirical mode decomposition and
several evaluation criterions
[J]. J4, 2014, 48(3): 423-429.
[2] LIU Ye-feng, XU Guan-qun, PAN Quan-ke, CHAI Tian-you. Magnetic material molding sintering production scheduling optimization method and its application[J]. J4, 2013, 47(9): 1517-1523.
[3] LIN Yi-ning, WEI Wei, DAI Yuan-ming. Semi-supervised Hough Forest tracking method[J]. J4, 2013, 47(6): 977-983.
[4] WANG Peng-jun, WANG Zhen-hai, CHEN Yao-wu, LI Hui. Searching the best polarity for fixed polarity Reed-Muller
circuits based on delay model
[J]. J4, 2013, 47(2): 361-366.
[5] LI Kan, HUANG Wen-xiong, HUANG Zhong-hua. Multi-sensor detected object classification method based on
support vector machine
[J]. J4, 2013, 47(1): 15-22.
[6] XIAO Dong-feng, YANG Chun-jie,SONG Zhi-huan. The forecasting model of blast furnace gas output
based on improved BP network
[J]. J4, 2012, 46(11): 2103-2108.
[7] YAO Fu-tian, QIAN Yun-tao, LI Ji-ming. Semi-supervised learning based Gaussian processes for
hyperspectral image classification
[J]. J4, 2012, 46(7): 1295-1300.
[8] WANG Hong-bo, ZHAO Guang-zhou, QI Dong-lian, LU Da. Fast incremental learning method for one-class support vector machine[J]. J4, 2012, 46(7): 1327-1332.
[9] PAN Jun, KONG Fan-sheng, WANG Rui-qin. Locality sensitive discriminant transductive learning[J]. J4, 2012, 46(6): 987-994.
[10] AI Jie-qing, GAO Ji, PENG Yan-bin, ZHENG Zhi-jun. Negotiation decision model based on transductive
support vector machine
[J]. J4, 2012, 46(6): 967-973.
[11] JIN Zhuo-jun, QIAN Hui, ZHU Miao-liang. Trajectory evaluation method based on intention analysis[J]. J4, 2011, 45(10): 1732-1737.
[12] HU Bin,LI Yang,GAO Ji. Symbolic model checking based verification method
for trustworthy cross-organizational collaboration system
[J]. J4, 2011, 45(9): 1558-1565.
[13] WANG Xiu-jun, HU Xie-he. An improved control strategy of single neuron PID[J]. J4, 2011, 45(8): 1498-1501.
[14] LI Xiao-jun, DAI Lin, SHI Han-xiao, HUANG Qi. Survey on sentiment orientation analysis of texts[J]. J4, 2011, 45(7): 1167-1174.
[15] GU Hong, ZHAO Guang-zhou. Image retrieval and recognition based on generalized
local distance functions
[J]. J4, 2011, 45(4): 596-601.