基于简化概率选择框架的双足机器人模仿学习

doi:10.3785/j.issn.1008-973X.2026.05.016

浙江大学学报(工学版)

2026, Vol. 60

Issue (5): 1071-1081 DOI: 10.3785/j.issn.1008-973X.2026.05.016

计算机技术、控制工程

基于简化概率选择框架的双足机器人模仿学习

薛雯(

),赵硕,李永强*(

)

浙江工业大学信息工程学院，浙江杭州 310023

Imitation learning for bipedal robots based on simplified probabilistic framework for options

Wen XUE(

),Shuo ZHAO,Yongqiang LI*(

)

College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China

全文: PDF(1878 KB) HTML

摘要：

专家数据未显式满足马尔可夫性质会限制模仿学习方法的有效性,为此提出基于简化概率选择框架的分层模仿学习方法. 通过保留选项变量并去除终止变量, 构建紧凑的策略建模框架. 在优化过程中, 结合期望最大化算法进行隐变量建模, 引入拉格朗日乘子法进行约束条件处理（如策略归一性）. 在多个典型连续动作控制任务中开展仿真实验, 对比不同模仿学习方法的训练性能. 结果表明, 所提方法在非马尔可夫条件下训练过程更稳定、策略收敛性更佳. 将该模仿学习模型应用于双足机器人仿真, 实现了机器人稳定的前向行走, 验证了简化概率选择框架的可行性与有效性.

关键词： 双足机器人; 模仿学习; 隐变量建模; 期望最大化算法; 拉格朗日优化

Abstract:

To address the limitation of imitation learning methods caused by expert data that does not explicitly satisfy Markov properties, a hierarchical imitation learning method based on a simplified probabilistic framework for options was proposed. By retaining option variables and removing termination variables, a compact strategy modeling framework was constructed. In the optimization process, the expectation maximization algorithm was combined to model latent variables, and the Lagrange multiplier method was introduced to handle constraints such as policy normalization. Simulation experiments on multiple typical continuous action control tasks were conducted, and the training performance was compared with various imitation learning methods. Results indicate that the proposed method has a more stable training process and better policy convergence under non-Markov conditions. Furthermore, the imitation learning model was applied to the simulation of bipedal robots, achieving stable forward walking and verifying the feasibility and effectiveness of the simplified probabilistic framework for options.

Key words: bipedal robot imitation learning latent variable modeling expectation maximization algorithm Lagrangian optimization

收稿日期: 2025-08-16 出版日期: 2026-05-06

CLC:

TP18

基金资助: 国家自然科学基金资助项目（U2341216）.

通讯作者: 李永强 E-mail: xuewen_xw@163.com;yqli@zjut.edu.cn

作者简介: 薛雯（2001—），女，硕士生，从事双足机器人研究. orcid.org/0009-0004-7940-1612. E-mail：xuewen_xw@163.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	薛雯
	赵硕
	李永强

引用本文:

薛雯,赵硕,李永强. 基于简化概率选择框架的双足机器人模仿学习[J]. 浙江大学学报(工学版), 2026, 60(5): 1071-1081.

Wen XUE,Shuo ZHAO,Yongqiang LI. Imitation learning for bipedal robots based on simplified probabilistic framework for options. Journal of ZheJiang University (Engineering Science), 2026, 60(5): 1071-1081.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.05.016 或 https://www.zjujournals.com/eng/CN/Y2026/V60/I5/1071

图 1 简化概率选择框架策略模型结构图

表 1 控制参数的取值范围

图 2 不同模仿学习方法在典型连续动作控制任务中的训练性能对比

表 2 在典型连续动作控制任务中不同模仿学习方法的平均回报与标准差

表 3 模仿学习训练过程中的超参数配置

图 3 不同模仿学习方法在双足机器人控制任务中的训练性能对比

图 4 双足机器人在所提模型控制下的前向稳定行走过程

1	LI Z, PENG X B, ABBEEL P, et al Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control[J]. The International Journal of Robotics Research, 2025, 44 (5): 840- 888 doi: 10.1177/02783649241285161
2	RAIBERT M H, BROWN H B Jr, CHEPPONIS M Experiments in balance with a 3D one-legged hopping machine[J]. The International Journal of Robotics Research, 1984, 3 (2): 75- 92 doi: 10.1177/027836498400300207
3	CASTILLO G A, WENG B, ZHANG W, et al Reinforcement learning-based cascade motion policy design for robust 3D bipedal locomotion[J]. IEEE Access, 2022, 10: 20135- 20148 doi: 10.1109/ACCESS.2022.3151771
4	GULIYEV Z, PARSAYAN A. Reinforcement learning based robot control [C]// Proceedings of the IEEE 16th International Conference on Application of Information and Communication Technologies. Washington DC: IEEE, 2023: 1–6.
5	WANG S, BRAAKSMA J, BABUSKA R, et al. Reinforcement learning control for biped robot walking on uneven surfaces [C]// Proceedings of the 2006 IEEE International Joint Conference on Neural Network Proceedings. Vancouver: IEEE, 2006: 4173–4178.
6	SHAO Y, JIN Y, HUANG Z, et al A learning-based control pipeline for generic motor skills for quadruped robots[J]. Journal of Zhejiang University: Science A, 2024, 25 (6): 443- 454 doi: 10.1631/jzus.A2300128
7	JIN Y, LIU X, SHAO Y, et al High-speed quadrupedal locomotion by imitation-relaxation reinforcement learning[J]. Nature Machine Intelligence, 2022, 4 (12): 1198- 1208 doi: 10.1038/s42256-022-00576-3
8	KIM J W, ZHAO T Z, SCHMIDGALL S, et al. Surgical robot transformer (SRT): imitation learning for surgical tasks [EB/OL]. (2024−07−17)[2025−06−11]. https://arxiv.org/pdf/2407.12998.
9	VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575 (7782): 350- 354 doi: 10.1038/s41586-019-1724-z
10	丁加涛, 何杰, 李林芷, 等基于模型预测控制的仿人机器人实时步态优化[J]. 浙江大学学报: 工学版, 2019, 53 (10): 1843- 1851 DING Jiatao, HE Jie, LI Linzhi, et al Real-time walking pattern optimization for humanoid robot based on model predictive control[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (10): 1843- 1851
11	秦海鹏, 秦瑞, 施晓芬, 等基于模型预测的四足机器人运动控制[J]. 浙江大学学报: 工学版, 2024, 58 (8): 1565- 1576 QIN Haipeng, QIN Rui, SHI Xiaofen, et al Motion control of quadruped robot based on model prediction[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (8): 1565- 1576
12	AHN K, MHAMMEDI Z, MANIA H, et al. Model predictive control via on-policy imitation learning [EB/OL]. (2022−10−17)[2025−06−11]. https://arxiv.org/pdf/2210.09206.
13	TAGLIABUE A, HOW J P. Output feedback tube MPC-guided data augmentation for robust, efficient sensorimotor policy learning [C]// Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Kyoto: IEEE, 2022: 8644−8651.
14	DUAN H, DAO J, GREEN K, et al. Learning task space actions for bipedal locomotion [C]// Proceedings of the IEEE International Conference on Robotics and Automation. Xi’an: IEEE, 2021: 1276–1282.
15	SATO M A, NAKAMURA Y, ISHII S. Reinforcement learning for biped locomotion [C]// Artificial Neural Networks — ICANN 2002. Berlin: Springer, 2002: 777–782.
16	PETERS J, VIJAYAKUMAR S, SCHAAL S. Reinforcement learning for humanoid robotics [C]// Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots. Karlsruhe: [s.n.], 2003: 1–20.
17	PENG X B, ABBEEL P, LEVINE S, et al DeepMimic: example-guided deep reinforcement learning of physics-based character skills[J]. ACM Transactions on Graphics, 2018, 37 (4): 1- 14
18	PENG X B, KANAZAWA A, MALIK J, et al SFV: reinforcement learning of physical skills from videos[J]. ACM Transactions on Graphics, 2018, 37 (6): 1- 14
19	GALLJAMOV R, ZHAO G, BELOUSOV B, et al. Improving sample efficiency of example-guided deep reinforcement learning for bipedal walking [C]// Proceedings of the IEEE-RAS 21st International Conference on Humanoid Robots. Ginowan: IEEE, 2023: 587–593.
20	JUDAH K, FERN A, TADEPALLI P, et al Imitation learning with demonstrations and shaping rewards[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2014, 28 (1): 1 doi: 10.1609/aaai.v28i1.9024
21	WEI T, LUO Q, MO Y, et al Design of the three-bus control system utilising periodic relay for a centipede-like robot[J]. Robotica, 2016, 34 (8): 1841- 1854 doi: 10.1017/S0263574714002628
22	WANG D, BELTRAME G. Deployable reinforcement learning with variable control rate [EB/OL]. (2024–04–02)[2025–06–11]. https://arxiv.org/pdf/2401.09286.
23	PAN Z, YIN S, WEN G, et al Reinforcement learning control for a three-link biped robot with energy-efficient periodic gaits[J]. Acta Mechanica Sinica, 2023, 39 (2): 522304 doi: 10.1007/s10409-022-22304-x
24	AGRAWAL R, DAHLIN N, JAIN R, et al Markov balance satisfaction improves performance in strictly batch offline imitation learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39 (15): 15311- 15319
25	DANIEL C, VAN HOOF H, PETERS J, et al Probabilistic inference for determining options in reinforcement learning[J]. Machine Learning, 2016, 104 (2): 337- 357
26	ZHANG Z, PASCHALIDIS I. Provable hierarchical imitation learning via EM [C]// Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS). San Diego: PMLR, 2021, 130: 883–891.
27	JING M, HUANG W, SUN F, et al. Adversarial option-aware hierarchical imitation learning [C]// Proceedings of the International Conference on Machine Learning (ICML). [S. l.]: PMLR, 2021: 5097–5106.
28	LI Z, XU T, QIN Z, et al. Imitation learning from imperfection: theoretical justifications and algorithms [C]// 37th Conference on Neural Information Processing Systems. 2023: 1−40.
29	SIKCHI H S, ZHENG Q, ZHANG A, et al. Dual RL: unification and new methods for reinforcement and imitation learning [C]// Proceedings of the International Conference on Learning Representations. Vienna: [s.n.], 2024: 1−48.
30	KUJANPÄÄ K, PAJARINEN J, ILIN A. Hierarchical imitation learning with vector quantized models [C]// Proceedings of the 40th International Conference on Machine Learning (ICML). [S.l.]: PMLR, 2023: 17896–17919.
31	GRANDIA R, JENELTEN F, YANG S, et al Perceptive locomotion through nonlinear model-predictive control[J]. IEEE Transactions on Robotics, 2023, 39 (5): 3402- 3421 doi: 10.1109/TRO.2023.3275384
32	DARIO BELLICOSO C, GEHRING C, HWANGBO J, et al. Perception-less terrain adaptation through whole body control and hierarchical optimization [C]// Proceedings of the IEEE-RAS 16th International Conference on Humanoid Robots. Cancun: IEEE, 2017: 558–564.
33	SLEIMAN J P, FARSHIDIAN F, MINNITI M V, et al A unified MPC framework for whole-body dynamic locomotion and manipulation[J]. IEEE Robotics and Automation Letters, 2021, 6 (3): 4688- 4695 doi: 10.1109/LRA.2021.3068908
34	YI X, CARAMANIS C. Regularized EM algorithms: a unified framework and statistical guarantees [C]// Proceedings of the 28th Advances in Neural Information Processing Systems (NeurIPS). [S.l.]: Curran Associates, Inc. , 2015: 1567−1575.
35	SUTTON R S, PRECUP D, SINGH S Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning[J]. Artificial Intelligence, 1999, 112 (1/2): 181- 211 doi: 10.1016/s0004-3702(99)00052-1
36	BACON P L, HARB J, PRECUP D. The option-critic architecture [J]. Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2017: 1726–1734.
37	KHREICH W, GRANGER E, MIRI A, et al On the memory complexity of the forward–backward algorithm[J]. Pattern Recognition Letters, 2010, 31 (2): 91- 99 doi: 10.1016/j.patrec.2009.09.023
38	PENG X B, MA Z, ABBEEL P, et al. AMP: adversarial motion priors for stylized physics-based character control [EB/OL]. (2022−05−12)[2025−06−11]. https://arxiv.org/pdf/2104.02180.
39	TAN J, ZHANG T, COUMANS E, et al. Sim-to-real: learning agile locomotion for quadruped robots [EB/OL]. (2018–05–16)[2025–06–11]. https://arxiv.org/pdf/1804.10332.
40	PENG X B, BERSETH G, VAN DE PANNE M Terrain-adaptive locomotion skills using deep reinforcement learning[J]. ACM Transactions on Graphics, 2016, 35 (4): 1- 12 doi: 10.1145/2897824.2925881
41	KIM D, DI CARLO J, KATZ B, et al. Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control [EB/OL]. (2019−09−14)[2025−06−11]. https://arxiv.org/pdf/1909.06586.
42	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift [C]// Proceedings of the 32nd International Conference on Machine Learning. [S.l.]: PMLR, 2015: 448−456.

[1]	郭策,曾志文,朱鹏铭,周智千,卢惠民. 基于图卷积模仿学习的分布式群集控制[J]. 浙江大学学报(工学版), 2022, 56(6): 1055-1061.
[2]	汤自林,高霄,肖晓晖. 基于模仿学习的变刚度人机协作搬运控制[J]. 浙江大学学报(工学版), 2021, 55(11): 2091-2099.
[3]	袁海辉,葛一敏,甘春标. 不确定性扰动下双足机器人动态步行的自适应鲁棒控制[J]. 浙江大学学报(工学版), 2019, 53(11): 2049-2057.
[4]	陈迪剑, 徐一展, 王斌锐. 基于双生成函数的步行机器人最优步态生成[J]. 浙江大学学报(工学版), 2018, 52(7): 1253-1259.
[5]	徐程, 曲昭伟, 王殿海, 金盛. 混合自行车交通流速度分布模型[J]. 浙江大学学报(工学版), 2017, 51(7): 1331-1338.
[6]	熊圆圆, 潘刚, 于玲. 一种欠驱动平面双足机器人变速控制策略[J]. J4, 2013, 47(6): 1006-1012.

Viewed

Full text

Abstract

Cited

Shared

Discussed