Please wait a minute...
IET Cyber-Systems and Robotics  2019, Vol. 1 Issue (1): 28-37    DOI: 10.1049/iet-csr.2018.0001
    
时效性决策RL
Jiangcheng Zhu1 , Zhepei Wang1 , Douglas Mcilwraith2 , Chao Wu3 , Chao Xu1 , Yike Guo2
1 Institute of Cyber-Systems and Control, Department of Control Science and Engineering, Zhejiang University, Hangzhou, People's Republic of China  2 Data Science Institute, Department of Computing, Imperial College London, London, UK  3 School of Public Affairs, Zhejiang University, Hangzhou, People's Republic of China 
Time-in-action RL
Jiangcheng Zhu1 , Zhepei Wang1 , Douglas Mcilwraith, Chao Wu, Chao Xu, Yike Guo2
1 Institute of Cyber-Systems and Control, Department of Control Science and Engineering, Zhejiang University, Hangzhou, People's Republic of China  2 Data Science Institute, Department of Computing, Imperial College London, London, UK  3 School of Public Affairs, Zhejiang University, Hangzhou, People's Republic of China 
 全文: PDF 
摘要: 作者提出了一种新的强化学习(RL)框架,其中智能体行为受传统控制理论的控制。这种集成方法称为时效性决策RL,它使RL能够应用于许多实际系统,这些系统的基本动力学在它们的控制理论公式中是已知的。促进这种集成的关键是对显示时间函数建模,将状态-动作对映射到底层控制器完成动作的时间。在他们的框架中,他们通过动作的价值(动作价值)和执行动作所需的时间(动作时间)来描述动作。动作价值来自于RL对于状态的策略。动作时间通过一个从底层控制器测量活动中学习的显式时间模型来估计。然后使用嵌入时间模型训练RL价值网络以预测动作时间。此方法通过一个Atari Pong的变体进行了测试,并被证明是收敛的。
Abstract: The authors propose a novel reinforcement learning (RL) framework, where agent behaviour is governed by traditional control theory. This integrated approach, called time-in-action RL, enables RL to be applicable to many real-world systems, where underlying dynamics are known in their control theoretical formalism. The key insight to facilitate this integration is to model the explicit time function, mapping the state-action pair to the time accomplishing the action by its underlying controller. In their framework, they describe an action by its value (action value), and the time that it takes to perform (action time). An action-value results from the policy of RL regarding a state. Action time is estimated by an explicit time model learnt from the measured activities of the underlying controller. RL value network is then trained with embedded time model to predict action time. This approach is tested using a variant of Atari Pong and proved to be convergent.
收稿日期: 2018-12-14 出版日期: 2019-05-10
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
Jiangcheng Zhu
Zhepei Wang
Douglas Mcilwraith
Chao Wu
Chao Xu
Yike Guo

引用本文:

Jiangcheng Zhu, Zhepei Wang, Douglas Mcilwraith, Chao Wu, Chao Xu, Yike Guo. Time-in-action RL. IET Cyber-Systems and Robotics, 2019, 1(1): 28-37.

链接本文:

http://www.zjujournals.com/iet-csr/CN/10.1049/iet-csr.2018.0001        http://www.zjujournals.com/iet-csr/CN/Y2019/V1/I1/28

No related articles found!