|
|
Output feedback reinforcement learning control method
based on reference model |
HAO Chuan-chuan1, FANG Zhou2, LI Ping2 |
1. Department of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China;
2. School of Aeronautics and Astronautics, Zhejiang University, Hangzhou 310027, China |
|
|
Abstract Most existed direct policy search reinforcement learning (RL) control methods, which require full measurements of the system states and result in a state-feedback control policy, are only suitable for limited system. This work developed a novel RL control method which requires only system output and converges to an output-feedback control policy which enables the closed-loop system to have the expected dynamic properties. An informative reward function based on the reference model was adopted to effectively represent the desired closed-loop dynamics. A stochastic output-feedback control policy based on PID law was used, thus the priori knowledge could be effectively used to determine a proper initial policy through organized manual tuning techniques commonly used in the control system community. And the episodic Natural Actor-Critic (eNAC) algorithm, which is effective and has proper performance, was used for policy optimization. Simulations on a second-order unstable system and a linear parameter varying (LPV) model of unmanned air vehicle's (UAV's) longitudinal dynamics demonstrate the efficiency of the proposed algorithm.
|
Published: 01 March 2013
|
|
基于参考模型的输出反馈强化学习控制
现有的直接策略搜索强化学习控制算法大多是状态完全可观对象设计状态反馈控制策略,其适用的对象范围十分有限.为此,提出一种适用范围更广的模型参考输出反馈强化学习控制算法,其学习过程仅依赖对象输出,并可以获得使闭环系统具有期望动态性能的输出反馈控制策略.算法构造了以参考模型为基础的回报函数,可以有效地描述系统的期望闭环动态性能|采用以PID输出反馈控制律为基础的参数化随机控制策略,以便于利用先验知识、依据控制领域常用的PID参数经验整定方法确定较好的初始策略,以缩短学习时间|并使用具有良好学习性能的eNAC算法进行控制策略优化.二阶开环不稳定对象和某型高亚音速无人机俯仰通道线性参变(LPV)模型的学习控制仿真结果验证了算法的有效性.
|
|
[1] TAMEI T, SHIBATA T. Fast reinforcement learning for three-dimensional kinetic human-robot cooperation with EMG-to-activation model [J]. Advanced Robotics, 2011, 25(5): 563-580.
[2] HAN Y K, KIMURA H. Motions obtaining of multi-degree-freedom underwater robot by using reinforcement learning algorithms [C]∥ IEEE Region 10 Annual International Conference, Proceedings/TENCON. New Jersey: IEEE,2010: 1498-1502.
[3] PETERS J, SCHAAL S. Natural actor-critic [J]. Neurocomputing, 2008, 71(7/8/9): 1180-1190.
[4] ABBEEL P. Apprenticeship learning and reinforcement learning with application to robotic control [D]. Stanford: Department of Computer Science, Stanford University, 2008.
[5] 余涛,胡细兵,刘靖.基于多步回溯Q(λ)学习算法的多目标最优潮流计算[J].华南理工大学学报:自然科学版,2010,38(10): 139-145.
YU Tao, HU Xi-bing, LIU Jing. Multi-objective optimal power flow calculation based on multi-step Q(λ)learning algorithm [J]. Journal of South China University of Technology: Natural Science Edition, 2010, 38(10): 139-145.
[6] CHU B, PARK J, HONG D. Tunnel ventilation controller design using an RLS-based natural actor-critic algorithm [J]. International Journal of Precision Engineering and Manufacturing, 2010, 11(6): 829-838.
[7] LEWIS F L, VRABIE D. Reinforcement learning and adaptive dynamic programming for feedback control [J]. IEEE Circuits and Systems Magazine, 009, 9(3): 32-50.
[8] LEWIS F L, VAMVOUDAKIS K G. Optimal adaptive control for unknown systems using output feedback by reinforcement learning methods [C]∥ Proceedings of 2010 8th IEEE International Conference on Control and Automation. New Jersey:IEEE Computer Society, 2010: 2138-2145.
[9] 王学宁,陈伟,张锰,等.增强学习中的直接策略搜索方法综述[J].智能系统学报,2007,2(1): 16-24.
WANG Xue-ning, CHEN Wei, ZHANG Meng, et al. A survey of direct policy search methods in reinforcement learning [J]. CAAI Transaction on Intelligent Systems, 2007, 2(1): 16-24. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|