Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (5): 1071-1081    DOI: 10.3785/j.issn.1008-973X.2026.05.016
计算机技术、控制工程     
基于简化概率选择框架的双足机器人模仿学习
薛雯(),赵硕,李永强*()
浙江工业大学 信息工程学院,浙江 杭州 310023
Imitation learning for bipedal robots based on simplified probabilistic framework for options
Wen XUE(),Shuo ZHAO,Yongqiang LI*()
College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
 全文: PDF(1878 KB)   HTML
摘要:

专家数据未显式满足马尔可夫性质会限制模仿学习方法的有效性,为此提出基于简化概率选择框架的分层模仿学习方法. 通过保留选项变量并去除终止变量, 构建紧凑的策略建模框架. 在优化过程中, 结合期望最大化算法进行隐变量建模, 引入拉格朗日乘子法进行约束条件处理(如策略归一性). 在多个典型连续动作控制任务中开展仿真实验, 对比不同模仿学习方法的训练性能. 结果表明, 所提方法在非马尔可夫条件下训练过程更稳定、策略收敛性更佳. 将该模仿学习模型应用于双足机器人仿真, 实现了机器人稳定的前向行走, 验证了简化概率选择框架的可行性与有效性.

关键词: 双足机器人模仿学习隐变量建模期望最大化算法拉格朗日优化    
Abstract:

To address the limitation of imitation learning methods caused by expert data that does not explicitly satisfy Markov properties, a hierarchical imitation learning method based on a simplified probabilistic framework for options was proposed. By retaining option variables and removing termination variables, a compact strategy modeling framework was constructed. In the optimization process, the expectation maximization algorithm was combined to model latent variables, and the Lagrange multiplier method was introduced to handle constraints such as policy normalization. Simulation experiments on multiple typical continuous action control tasks were conducted, and the training performance was compared with various imitation learning methods. Results indicate that the proposed method has a more stable training process and better policy convergence under non-Markov conditions. Furthermore, the imitation learning model was applied to the simulation of bipedal robots, achieving stable forward walking and verifying the feasibility and effectiveness of the simplified probabilistic framework for options.

Key words: bipedal robot    imitation learning    latent variable modeling    expectation maximization algorithm    Lagrangian optimization
收稿日期: 2025-08-16 出版日期: 2026-05-06
CLC:  TP18  
基金资助: 国家自然科学基金资助项目(U2341216).
通讯作者: 李永强     E-mail: xuewen_xw@163.com;yqli@zjut.edu.cn
作者简介: 薛雯(2001—),女,硕士生,从事双足机器人研究. orcid.org/0009-0004-7940-1612. E-mail:xuewen_xw@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
薛雯
赵硕
李永强

引用本文:

薛雯,赵硕,李永强. 基于简化概率选择框架的双足机器人模仿学习[J]. 浙江大学学报(工学版), 2026, 60(5): 1071-1081.

Wen XUE,Shuo ZHAO,Yongqiang LI. Imitation learning for bipedal robots based on simplified probabilistic framework for options. Journal of ZheJiang University (Engineering Science), 2026, 60(5): 1071-1081.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.05.016        https://www.zjujournals.com/eng/CN/Y2026/V60/I5/1071

图 1  简化概率选择框架策略模型结构图
参数取值范围
关节力矩限制$ {\tau }_{t} $[?60,60]
足端法向接触力$ \hat{f}_{z}^{{\mathrm{foot}}} $[0,400]
足端切向接触力 $ \hat{f}_{x,y}^{{\mathrm{foot}}} $[?100,100]
足端位置限制$ x $方向$ \hat{p}_{x}^{{\mathrm{foot}}} $[?0.3,0.3]
足端位置限制$ y $方向$ \hat{p}_{y}^{{\mathrm{foot}}} $[?0.15,0.15]
足端位置限制$ z $方向$ \hat{p}_{z}^{{\mathrm{foot}}} $[?0.67,?0.59]
表 1  控制参数的取值范围
图 2  不同模仿学习方法在典型连续动作控制任务中的训练性能对比
方法$ \overline{R}\pm \sigma $
类人机器
人行走
机械臂推物二维行走
机器人
游动机器人双足行走
机器人
类人机器
人起身
四足蚂蚁
机器人
单腿跳跃
机器人
双足猎豹
模型
GAIL-Option?1701.46±
1109.85
?636.24 ±
239.12
509.71 ±
132.52
12.38 ±
41.99
257.79 ±
10.60
360.16 ±
198.64
1958.55 ±
226.74
489.35 ±
74.15
1924.50 ±
217.21
SPFFO?158.63 ±
48.26
?576.64 ±
33.26
1217.23 ±
756.50
?3.97 ±
7.60
?1.74 ±
7.84
275.29 ±
40.56
1027.70 ±
256.64
553.32 ±
279.83
2390.93 ±
358.77
PFFO498.22 ±
28.01
?391.40 ±
14.19
333.87 ±
132.80
41.52 ±
6.75
203.25 ±
86.42
885.02 ±
25.37
2251.33 ±
214.70
329.40 ±
194.13
1898.57 ±
574.98
本研究517.44 ±
46.11
?384.08 ±
24.51
2296.74 ±
250.19
45.94 ±
1.95
264.40 ±
16.82
915.27 ±
48.67
2407.19 ±
106.10
704.11 ±
61.90
2871.36 ±
155.93
DVL153.56 ±
63.20
?44.11 ±
0.12
192.49 ±
175.22
31.01 ±
3.05
?95.51 ±
18.47
812.97 ±
298.35
2010.40 ±
684.32
76.29 ±
43.06
340.72 ±
326.62
ISWBC?536.76 ±
594.76
?387.42 ±
13.04
1593.60 ±
446.14
42.98 ±
5.65
256.99 ±
31.92
?173.89 ±
146.13
2412.23 ±
72.90
488.13 ±
79.01
2738.28 ±
103.02
HIPS531.89 ±
34.90
?384.13 ±
17.24
1292.01 ±
189.64
25.08 ±
16.21
244.24 ±
29.73
865.51 ±
18.63
2338.80 ±
88.94
431.03 ±
69.39
1699.79 ±
187.79
表 2  在典型连续动作控制任务中不同模仿学习方法的平均回报与标准差
参数名称数值
优化器Adam
$ {\lambda }_{{\mathrm{phys}}} $0.01
批量大小64
梯度裁剪阈值1.0
训练轮数(最大)3000
初始学习率$ \alpha $3×10?4
力矩PD控制$ {K}_{{\mathrm{p}}},{K}_{{\mathrm{d}}} $160,18
早停机制28,60,60,60,28
$ {w}_{{\mathrm{p}}},{w}_{{\mathrm{e}}},{w}_{{\mathrm{m}}},{w}_{{\mathrm{f}}},{w}_{{\mathrm{trk}}} $1.0,0.1,0.5,0.3,0.8
表 3  模仿学习训练过程中的超参数配置
图 3  不同模仿学习方法在双足机器人控制任务中的训练性能对比
图 4  双足机器人在所提模型控制下的前向稳定行走过程
1 LI Z, PENG X B, ABBEEL P, et al Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control[J]. The International Journal of Robotics Research, 2025, 44 (5): 840- 888
doi: 10.1177/02783649241285161
2 RAIBERT M H, BROWN H B Jr, CHEPPONIS M Experiments in balance with a 3D one-legged hopping machine[J]. The International Journal of Robotics Research, 1984, 3 (2): 75- 92
doi: 10.1177/027836498400300207
3 CASTILLO G A, WENG B, ZHANG W, et al Reinforcement learning-based cascade motion policy design for robust 3D bipedal locomotion[J]. IEEE Access, 2022, 10: 20135- 20148
doi: 10.1109/ACCESS.2022.3151771
4 GULIYEV Z, PARSAYAN A. Reinforcement learning based robot control [C]// Proceedings of the IEEE 16th International Conference on Application of Information and Communication Technologies. Washington DC: IEEE, 2023: 1–6.
5 WANG S, BRAAKSMA J, BABUSKA R, et al. Reinforcement learning control for biped robot walking on uneven surfaces [C]// Proceedings of the 2006 IEEE International Joint Conference on Neural Network Proceedings. Vancouver: IEEE, 2006: 4173–4178.
6 SHAO Y, JIN Y, HUANG Z, et al A learning-based control pipeline for generic motor skills for quadruped robots[J]. Journal of Zhejiang University: Science A, 2024, 25 (6): 443- 454
doi: 10.1631/jzus.A2300128
7 JIN Y, LIU X, SHAO Y, et al High-speed quadrupedal locomotion by imitation-relaxation reinforcement learning[J]. Nature Machine Intelligence, 2022, 4 (12): 1198- 1208
doi: 10.1038/s42256-022-00576-3
8 KIM J W, ZHAO T Z, SCHMIDGALL S, et al. Surgical robot transformer (SRT): imitation learning for surgical tasks [EB/OL]. (2024−07−17)[2025−06−11]. https://arxiv.org/pdf/2407.12998.
9 VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575 (7782): 350- 354
doi: 10.1038/s41586-019-1724-z
10 丁加涛, 何杰, 李林芷, 等 基于模型预测控制的仿人机器人实时步态优化[J]. 浙江大学学报: 工学版, 2019, 53 (10): 1843- 1851
DING Jiatao, HE Jie, LI Linzhi, et al Real-time walking pattern optimization for humanoid robot based on model predictive control[J]. Journal of Zhejiang University: Engineering Science, 2019, 53 (10): 1843- 1851
11 秦海鹏, 秦瑞, 施晓芬, 等 基于模型预测的四足机器人运动控制[J]. 浙江大学学报: 工学版, 2024, 58 (8): 1565- 1576
QIN Haipeng, QIN Rui, SHI Xiaofen, et al Motion control of quadruped robot based on model prediction[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (8): 1565- 1576
12 AHN K, MHAMMEDI Z, MANIA H, et al. Model predictive control via on-policy imitation learning [EB/OL]. (2022−10−17)[2025−06−11]. https://arxiv.org/pdf/2210.09206.
13 TAGLIABUE A, HOW J P. Output feedback tube MPC-guided data augmentation for robust, efficient sensorimotor policy learning [C]// Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. Kyoto: IEEE, 2022: 8644−8651.
14 DUAN H, DAO J, GREEN K, et al. Learning task space actions for bipedal locomotion [C]// Proceedings of the IEEE International Conference on Robotics and Automation. Xi’an: IEEE, 2021: 1276–1282.
15 SATO M A, NAKAMURA Y, ISHII S. Reinforcement learning for biped locomotion [C]// Artificial Neural Networks — ICANN 2002. Berlin: Springer, 2002: 777–782.
16 PETERS J, VIJAYAKUMAR S, SCHAAL S. Reinforcement learning for humanoid robotics [C]// Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots. Karlsruhe: [s.n.], 2003: 1–20.
17 PENG X B, ABBEEL P, LEVINE S, et al DeepMimic: example-guided deep reinforcement learning of physics-based character skills[J]. ACM Transactions on Graphics, 2018, 37 (4): 1- 14
18 PENG X B, KANAZAWA A, MALIK J, et al SFV: reinforcement learning of physical skills from videos[J]. ACM Transactions on Graphics, 2018, 37 (6): 1- 14
19 GALLJAMOV R, ZHAO G, BELOUSOV B, et al. Improving sample efficiency of example-guided deep reinforcement learning for bipedal walking [C]// Proceedings of the IEEE-RAS 21st International Conference on Humanoid Robots. Ginowan: IEEE, 2023: 587–593.
20 JUDAH K, FERN A, TADEPALLI P, et al Imitation learning with demonstrations and shaping rewards[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2014, 28 (1): 1
doi: 10.1609/aaai.v28i1.9024
21 WEI T, LUO Q, MO Y, et al Design of the three-bus control system utilising periodic relay for a centipede-like robot[J]. Robotica, 2016, 34 (8): 1841- 1854
doi: 10.1017/S0263574714002628
22 WANG D, BELTRAME G. Deployable reinforcement learning with variable control rate [EB/OL]. (2024–04–02)[2025–06–11]. https://arxiv.org/pdf/2401.09286.
23 PAN Z, YIN S, WEN G, et al Reinforcement learning control for a three-link biped robot with energy-efficient periodic gaits[J]. Acta Mechanica Sinica, 2023, 39 (2): 522304
doi: 10.1007/s10409-022-22304-x
24 AGRAWAL R, DAHLIN N, JAIN R, et al Markov balance satisfaction improves performance in strictly batch offline imitation learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39 (15): 15311- 15319
25 DANIEL C, VAN HOOF H, PETERS J, et al Probabilistic inference for determining options in reinforcement learning[J]. Machine Learning, 2016, 104 (2): 337- 357
26 ZHANG Z, PASCHALIDIS I. Provable hierarchical imitation learning via EM [C]// Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS). San Diego: PMLR, 2021, 130: 883–891.
27 JING M, HUANG W, SUN F, et al. Adversarial option-aware hierarchical imitation learning [C]// Proceedings of the International Conference on Machine Learning (ICML). [S. l.]: PMLR, 2021: 5097–5106.
28 LI Z, XU T, QIN Z, et al. Imitation learning from imperfection: theoretical justifications and algorithms [C]// 37th Conference on Neural Information Processing Systems. 2023: 1−40.
29 SIKCHI H S, ZHENG Q, ZHANG A, et al. Dual RL: unification and new methods for reinforcement and imitation learning [C]// Proceedings of the International Conference on Learning Representations. Vienna: [s.n.], 2024: 1−48.
30 KUJANPÄÄ K, PAJARINEN J, ILIN A. Hierarchical imitation learning with vector quantized models [C]// Proceedings of the 40th International Conference on Machine Learning (ICML). [S.l.]: PMLR, 2023: 17896–17919.
31 GRANDIA R, JENELTEN F, YANG S, et al Perceptive locomotion through nonlinear model-predictive control[J]. IEEE Transactions on Robotics, 2023, 39 (5): 3402- 3421
doi: 10.1109/TRO.2023.3275384
32 DARIO BELLICOSO C, GEHRING C, HWANGBO J, et al. Perception-less terrain adaptation through whole body control and hierarchical optimization [C]// Proceedings of the IEEE-RAS 16th International Conference on Humanoid Robots. Cancun: IEEE, 2017: 558–564.
33 SLEIMAN J P, FARSHIDIAN F, MINNITI M V, et al A unified MPC framework for whole-body dynamic locomotion and manipulation[J]. IEEE Robotics and Automation Letters, 2021, 6 (3): 4688- 4695
doi: 10.1109/LRA.2021.3068908
34 YI X, CARAMANIS C. Regularized EM algorithms: a unified framework and statistical guarantees [C]// Proceedings of the 28th Advances in Neural Information Processing Systems (NeurIPS). [S.l.]: Curran Associates, Inc. , 2015: 1567−1575.
35 SUTTON R S, PRECUP D, SINGH S Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning[J]. Artificial Intelligence, 1999, 112 (1/2): 181- 211
doi: 10.1016/s0004-3702(99)00052-1
36 BACON P L, HARB J, PRECUP D. The option-critic architecture [J]. Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2017: 1726–1734.
37 KHREICH W, GRANGER E, MIRI A, et al On the memory complexity of the forward–backward algorithm[J]. Pattern Recognition Letters, 2010, 31 (2): 91- 99
doi: 10.1016/j.patrec.2009.09.023
38 PENG X B, MA Z, ABBEEL P, et al. AMP: adversarial motion priors for stylized physics-based character control [EB/OL]. (2022−05−12)[2025−06−11]. https://arxiv.org/pdf/2104.02180.
39 TAN J, ZHANG T, COUMANS E, et al. Sim-to-real: learning agile locomotion for quadruped robots [EB/OL]. (2018–05–16)[2025–06–11]. https://arxiv.org/pdf/1804.10332.
40 PENG X B, BERSETH G, VAN DE PANNE M Terrain-adaptive locomotion skills using deep reinforcement learning[J]. ACM Transactions on Graphics, 2016, 35 (4): 1- 12
doi: 10.1145/2897824.2925881
41 KIM D, DI CARLO J, KATZ B, et al. Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control [EB/OL]. (2019−09−14)[2025−06−11]. https://arxiv.org/pdf/1909.06586.
42 IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift [C]// Proceedings of the 32nd International Conference on Machine Learning. [S.l.]: PMLR, 2015: 448−456.
[1] 郭策,曾志文,朱鹏铭,周智千,卢惠民. 基于图卷积模仿学习的分布式群集控制[J]. 浙江大学学报(工学版), 2022, 56(6): 1055-1061.
[2] 汤自林,高霄,肖晓晖. 基于模仿学习的变刚度人机协作搬运控制[J]. 浙江大学学报(工学版), 2021, 55(11): 2091-2099.
[3] 袁海辉,葛一敏,甘春标. 不确定性扰动下双足机器人动态步行的自适应鲁棒控制[J]. 浙江大学学报(工学版), 2019, 53(11): 2049-2057.
[4] 陈迪剑, 徐一展, 王斌锐. 基于双生成函数的步行机器人最优步态生成[J]. 浙江大学学报(工学版), 2018, 52(7): 1253-1259.
[5] 徐程, 曲昭伟, 王殿海, 金盛. 混合自行车交通流速度分布模型[J]. 浙江大学学报(工学版), 2017, 51(7): 1331-1338.
[6] 熊圆圆, 潘刚, 于玲. 一种欠驱动平面双足机器人变速控制策略[J]. J4, 2013, 47(6): 1006-1012.