Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (7): 1528-1538    DOI: 10.3785/j.issn.1008-973X.2026.07.015
机械工程     
考虑劣化维护的单机调度深度强化学习模型和算法
陈勇(),杜习之,姜一炜,易文超*(),裴植,纪祖臻
浙江工业大学 机械工程学院,浙江 杭州 310023
Deep reinforcement learning models and algorithms for single-machine scheduling considering deteriorated maintenance
Yong CHEN(),Xizhi DU,Yiwei JIANG,Wenchao YI*(),Zhi PEI,Zuzhen JI
College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou 310023, China
 全文: PDF(1712 KB)   HTML
摘要:

针对单台机器在考虑劣化效应与维护策略下的调度问题,提出多阶段机器状态模型. 以最小化生产总成本为目标,设计结合劣化演化和维护效果的状态转移机制,综合考虑作业延迟成本、机器运行成本和维护成本,旨在使整个生产过程更加经济和高效. 基于深度强化学习方法构建调度与维护一体化决策模型框架,通过训练Agent在与环境交互中学习优化策略,实现对复杂动态系统中作业调度与维护时机的联合决策. 设计多种规模的算例并验证框架和模型对结果优化的有效性. 实验对比结果表明,所提出的模型框架及算法在作业调度和维护总成本控制方面相较于多种综合优化策略方法具有更优表现,能够有效协调作业调度与设备维护的冲突关系,在动态不确定环境下实现更具优势的调度和维护一体化的优化策略学习和应用.

关键词: 单机调度设备维护深度强化学习劣化效应集成优化    
Abstract:

A multi-stage machine state model was proposed to address the single-machine scheduling problem under machine degradation and maintenance strategies, with the objective of minimizing total production cost. A state transition mechanism was designed to incorporate both degradation evolution and maintenance effects. Job tardiness cost, machine operating cost, and maintenance cost were jointly considered to improve economic efficiency in production of the entire production process. An integrated decision-making framework for scheduling and maintenance based on deep reinforcement learning was developed, in which the Agent was trained through interaction with the environment to learn optimized scheduling and maintenance strategies. Joint decisions on job sequencing and maintenance timing were realized in complex dynamic systems. Benchmark instances of various scales were designed, and the effectiveness of the proposed model and framework was validated through computational experiments. The results indicate that the proposed approach achieves better performance in minimizing total scheduling and maintenance costs compared with several integrated optimization strategies. The conflict between production scheduling and machine maintenance is effectively balanced, and a more advantageous integrated optimization strategy for scheduling and maintenance is realized in dynamic and uncertain environments.

Key words: single-machine scheduling    equipment maintenance    deep reinforcement learning    deterioration effect    integration optimization
收稿日期: 2025-02-19 出版日期: 2026-05-23
CLC:  TP 181  
基金资助: 国家自然科学基金重点资助项目(W2411062);浙江省自然科学基金资助项目(LGG22G010002);国家自然科学基金资助项目(52005447, 71871203).
通讯作者: 易文超     E-mail: cy@zjut.edu.cn;yiwenchao@zjut.edu.cn
作者简介: 陈勇(1973—),男,教授,从事复杂系统智能算法与优化研究. orcid.org/0000-0001-7778-2731. E-mail:cy@zjut.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
陈勇
杜习之
姜一炜
易文超
裴植
纪祖臻

引用本文:

陈勇,杜习之,姜一炜,易文超,裴植,纪祖臻. 考虑劣化维护的单机调度深度强化学习模型和算法[J]. 浙江大学学报(工学版), 2026, 60(7): 1528-1538.

Yong CHEN,Xizhi DU,Yiwei JIANG,Wenchao YI,Zhi PEI,Zuzhen JI. Deep reinforcement learning models and algorithms for single-machine scheduling considering deteriorated maintenance. Journal of ZheJiang University (Engineering Science), 2026, 60(7): 1528-1538.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.07.015        https://www.zjujournals.com/eng/CN/Y2026/V60/I7/1528

图 1  劣化效应过程示意图
图 2  状态的转换
图 3  调度决策模型
符号动作描述数学形式
a1SPT最短加工时间优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;{p}_{j} $
a2LPT最长加工时间优先$ {\mathrm{m}\mathrm{a}\mathrm{x}}_{j\in B}\;{p}_{j} $
a3EDD最早交付期优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;{d}_{j} $
a4FCFS最早到达时间优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;{\mathrm{a}\mathrm{r}\mathrm{r}\mathrm{i}\mathrm{v}\mathrm{e}}_{j} $
a5MST最小松弛时间$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;\left({d}_{j}-\left(t+{p}_{j}\right)\right) $
a6CR最小临界比率$ {\mathrm{min}}_{j\in B}\;\left({d}_{j}-t\right)/{p}_{j} $
a7MDD修正交付时间优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;\mathrm{m}\mathrm{a}\mathrm{x}\;({d}_{j},t+{p}_{j}) $
a8PM执行不完全维护$ {M}_{i-1}+{R}_{\mathrm{P}\mathrm{M}} $
a9CM执行完全维护$ {M}_{i-1}+{R}_{\mathrm{C}\mathrm{M}} $
表 1  动作空间描述
图 4  机器健康状态-调度策略奖励曲线
图 5  单一策略运行步数-奖励曲线
图 6  DRL算法网络结构
图 7  DRL算法流程
规模SPT-MLPT-MFCFS-MEDD-M
MeanStdMeanStdMeanStdMeanStd
10200.00.0258.00.0192.00.0227.00.0
201005.610.21682.0108.81058.727.11334.282.1
302160.658.13725.4110.32678.0113.13075.8143.8
503610.0157.58074.8212.74679.4291.35065.8229.9
8011198.2450.422474.8752.114871.1597.915306.1665.6
10016709.3553.833877.21188.921372.8856.823350.4886.7
15032096.01198.471634.72146.842847.31791.347919.01474.0
规模MST-MCR-MMDD-M基准
MeanStdMeanStdMeanStdMeanStd
10197.00.0226.00.0191.00.0191.00.0
201163.020.61053.527.7973.822.5973.822.5
302879.373.22269.0109.52103.9102.92103.9102.9
504858.1307.33597.3191.63310.6172.13310.6172.1
8015181.1687.811678.6519.010642.0494.010642.0494.0
10021907.0937.216753.6800.715763.6766.815763.6766.8
15043670.21579.232857.91266.630546.11166.530546.11166.5
表 2  R-M集成优化策略下的成本均值和标准差
规模Min
SPT-MLPT-MEDD-MFCFS-MMST-MCR-MMDD-M
10200258192227197226191
2095615651008.041255.67510861006904
302017344124632831263220701939
50324572734314.024565425132212940
801021820871134071371113779103229494
10015076310611924121190192491480614030
表 3  R-M集成优化策略下的成本最小值
环境参数描述
n工件数量规模10, 20, 30, 50, 80, 100, 150
pj工件j的加工时间Discrete U (1, 10)
dj工件j的交付时间pj +Discrete U (n, 3n)
α延迟交付的成本系数1
β机器损耗的成本系数5
M0机器初始状态值100
tCM, tPMCM 和 PM 的时间20, 10
CCM, CPMCM 和 PM 的成本60, 40
Cbroke损坏的额外修复成本100
NScale奖励值归一化尺度100
h1, h2劣化和失效效应点60, 0
σ劣化效应因子0.05
M1, M2超出边界状态的惩罚基数2, 2
表 4  环境参数设置
超参数DQN算法PPO算法A2C算法
最大训练步长1.6×1071.6×1071.6×107
批量大小256256?
环境交互时间步?2048256
隐藏层节点数2×2562×2562×256
回放池大小1×107??
网络同步频率1×104??
学习率1×10?41×10?41×10?4
折扣因子 γ0.990.990.99
初始探索率0.01?0.01
λ(GAE)?0.950.95
熵正则化系数?0.050.05
初始探索率1??
探索衰减率0.99??
最小探索率0.001??
表 5  DRL算法超参数设置
图 8  规模对回合平均奖励的影响曲线
图 9  规模对回合平均步长的影响曲线
图 10  计算速度曲线(规模为80)
规模算法时间步/106平均FPS训练时长/h
10A2C11498.40.18
DQN11660.80.27
PPO11141.60.24
20A2C21473.70.37
DQN21440.60.54
PPO21109.60.50
30A2C21447.30.38
DQN21393.90.54
PPO21093.30.51
50A2C41400.90.79
DQN41283.41.11
PPO41065.41.05
80A2C81360.61.63
DQN81154.32.27
PPO81038.62.15
100A2C81350.41.66
DQN81121.42.31
PPO81018.62.19
150A2C161276.13.51
DQN161051.94.68
PPO161007.24.33
表 6  DRL算法训练性能比较
规模基准A2CDQNPPO
MeanStdMeanStdMeanStdMeanStd
10191.00.0191.41.0217.861.7192.514.8
20973.822.5908.09.5925.647.6901.56.1
302103.9102.91920.820.11948.3109.01880.56.6
503310.6172.13009.641.16177.72280.42936.719.0
8010642.0494.09712.765.110015.8480.49469.740.3
10015763.6766.814551.0172.214737.7708.414020.2119.5
15030546.11166.527272.6270.028197.41209.327073.3275.1
表 7  DRL方法的成本优化均值和标准差
图 11  不同算法的成本优化结果对比
规模Min
R-MDQNA2CPPO
10191.0191.0191.0191.0
20904.0895.0901.0900.0
301939.01878.01887.01878.0
502940.02948.03246.42911.6
809494.09550.39566.69397.0
10014030.014210.013909.613818.0
15027519.026743.826560.926600.9
表 8  DRL方法优化的最优值
规模A2CDQNPPO
$\Delta {\mathrm{mean}} $/%$\Delta \min $/%$\Delta {\mathrm{mean}} $/%$\Delta \min $/%$\Delta {\mathrm{mean}} $/%$\Delta \min $/%
10?0.230.00?12.310.00?0.780.00
207.251.015.210.338.010.44
309.533.257.982.7611.883.25
5010.00?0.27?46.41?9.4412.730.98
809.57?0.596.25?0.7612.381.03
1008.33?1.276.960.8712.431.53
15012.002.908.333.6112.833.45
表 9  DRL方法的成本优化效果
规模方法MinMean
中位数秩和中位数秩和
7PPO2931.8102957.310
7MDD-M2945.0153363.924
7A2C3027.4183058.814
7DQN3672.7363819.031
7SPT-M3263.9373633.036
7CR-M3258.3413644.542
7EDD-M4258.6454649.844
7MST-M4263.8524837.552
7FCFS-M4534.5615040.862
p1=0.00p2=0.00
表 10  不同方法的Friedman 检验排序结果
图 12  调度结果记录(规模为50)
1 JIA J, LU C, YIN L Energy saving in single-machine scheduling management: an improved multi-objective model based on discrete artificial bee colony algorithm[J]. Symmetry, 2022, 14 (3): 561
doi: 10.3390/sym14030561
2 ZHANG G, HU Y, SUN J, et al An improved genetic algorithm for the flexible job shop scheduling problem with multiple time constraints[J]. Swarm and Evolutionary Computation, 2020, 54: 100664
doi: 10.1016/j.swevo.2020.100664
3 HAJEJ Z, REZG N, ASKRI T Joint optimization of capacity, production and maintenance planning of leased machines[J]. Journal of Intelligent Manufacturing, 2020, 31 (2): 351- 374
doi: 10.1007/s10845-018-1450-7
4 DURAN TOKSARı M A branch and bound algorithm to minimize the single machine maximum tardiness problem under effects of learning and deterioration with setup times[J]. RAIRO - Operations Research, 2016, 50 (1): 211- 219
doi: 10.1051/ro/2015026
5 ZHANG X, XIA T, PAN E, et al Integrated optimization on production scheduling and imperfect preventive maintenance considering multi-degradation and learning-forgetting effects[J]. Flexible Services and Manufacturing Journal, 2022, 34 (2): 451- 482
doi: 10.1007/s10696-021-09410-1
6 SUN X, GENG X N Single-machine scheduling with deteriorating effects and machine maintenance[J]. International Journal of Production Research, 2019, 57 (10): 3186- 3199
doi: 10.1080/00207543.2019.1566675
7 GHALEB M, TAGHIPOUR S, SHARIFI M, et al Integrated production and maintenance scheduling for a single degrading machine with deterioration-based failures[J]. Computers and Industrial Engineering, 2020, 143: 106432
doi: 10.1016/j.cie.2020.106432
8 PAPROCKA I, KRENCZYK D, BURDUK A The method of production scheduling with uncertainties using the ants colony optimisation[J]. Applied Sciences, 2021, 11 (1): 171
9 宋文家, 张超勇, 尹勇, 等 基于多目标混合殖民竞争算法的设备维护与车间调度集成优化[J]. 中国机械工程, 2015, 26 (11): 1478- 1487
SONG Wenjia, ZHANG Chaoyong, YIN Yong, et al Integrated optimization of equipment maintenance and shop scheduling problem based on multi-objective hybrid imperialist competitive algorithm[J]. China Mechanical Engineering, 2015, 26 (11): 1478- 1487
doi: 10.3969/j.issn.1004-132X.2015.11.010
10 甘婕, 侯青玉, 汪思宇, 等 流水车间调度与视情维修的联合决策[J]. 工业工程与管理, 2023, 28 (1): 207- 214
GAN Jie, HOU Qingyu, WANG Siyu, et al The joint decision and optimization of flow-shop scheduling and condition based maintenance[J]. Industrial Engineering and Management, 2023, 28 (1): 207- 214
11 甘婕, 曾建潮 考虑劣化状态的单机调度与维修决策集成模型[J]. 控制与决策, 2016, 31 (3): 513- 520
GAN Jie, ZENG Jianchao Integrated model of single-machine scheduling and maintenance decision for degrading state systems[J]. Control and Decision, 2016, 31 (3): 513- 520
12 张昕莹, 陈璐, 杨雯惠 考虑系统时变效应与预防性维护的平行机调度[J]. 浙江大学学报: 工学版, 2022, 56 (2): 408- 418
ZHANG Xinying, CHEN Lu, YANG Wenhui A parallel-machine scheduling problem with time-changing effect and preventive maintenance[J]. Journal of Zhejiang University: Engineering Science, 2022, 56 (2): 408- 418
13 杨宏兵, 沈露, 成明, 等 带退化效应多态生产系统调度与维护集成优化[J]. 计算机集成制造系统, 2018, 24 (1): 80- 88
YANG Hongbing, SHEN Lu, CHENG Ming, et al Integrated optimization of scheduling and maintenance in multi-state production systems with deterioration effects[J]. Computer Integrated Manufacturing Systems, 2018, 24 (1): 80- 88
14 YANG H, LI W, WANG B Joint optimization of preventive maintenance and production scheduling for multi-state production systems based on reinforcement learning[J]. Reliability Engineering and System Safety, 2021, 214: 107713
doi: 10.1016/j.ress.2021.107713
15 LAMPRECHT R, WURST F, HUBER M F. Reinforcement learning based condition-oriented maintenance scheduling for flow line systems [EB/OL]. [2025-01-01]. https://ieeexplore.ieee.org/document/9557373/.
16 SALMASNIA A, SHABANI A Opportunistic maintenance modeling for series production systems based on bottleneck by considering energy consumption and market demand[J]. Journal of Industrial and Production Engineering, 2023, 40 (6): 506- 518
doi: 10.1080/21681015.2023.2234377
17 YU M, LI T, MA J. Joint optimization method of production scheduling for prefabricated components based on preventive maintenance [C]// 41st Chinese Control Conference. Hefei: IEEE, 2022: 1940–1944.
18 杨梦月, 董文杰, 刘思峰 基于2种周期维护类型和序列准备时间的单机调度[J]. 控制与决策, 2024, 39 (10): 3488- 3496
YANG Mengyue, DONG Wenjie, LIU Sifeng Single machine scheduling based on two types of periodic maintenance and sequence-dependent setup times[J]. Control and Decision, 2024, 39 (10): 3488- 3496
19 KANG K, SUBRAMANIAM V Integrated control policy of production and preventive maintenance for a deteriorating manufacturing system[J]. Computers and Industrial Engineering, 2018, 118: 266- 277
doi: 10.1016/j.cie.2018.02.026
20 XANTHOPOULOS A S, KIATIPIS A, KOULOURIOTIS D E, et al Reinforcement learning-based and parametric production-maintenance control policies for a deteriorating manufacturing system[J]. IEEE Access, 2017, 6: 576- 588
21 MNIH V, KAVUKCUOGLU K, SILVER D, et al Human-level control through deep reinforcement learning[J]. Nature, 2015, 518 (7540): 529- 533
doi: 10.1038/nature14236
22 LUO S Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning[J]. Applied Soft Computing, 2020, 91: 106208
doi: 10.1016/j.asoc.2020.106208
23 LIU R, PIPLANI R, TORO C Deep reinforcement learning for dynamic scheduling of a flexible job shop[J]. International Journal of Production Research, 2022, 60 (13): 4049- 4069
doi: 10.1080/00207543.2022.2058432
24 HAN B A, YANG J J Research on adaptive job shop scheduling problems based on dueling double DQN[J]. IEEE Access, 2020, 8: 186474- 186495
doi: 10.1109/ACCESS.2020.3029868
[1] 张艺炜,崔鑫,赵庆慧,陈燕. 无人机辅助车联网NOMA协同缓存优化[J]. 浙江大学学报(工学版), 2026, 60(6): 1289-1298.
[2] 杨青青,唐润朋,彭艺. 通信感知一体化系统中的联合波形与相移设计[J]. 浙江大学学报(工学版), 2026, 60(4): 906-914.
[3] 柳佳乐,薛雅丽,崔闪,洪君. 动态窗口法引导的TD3无地图导航算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1671-1679.
[4] 郝琨,孟璇,赵晓芳,李志圣. 融合自适应势场法和深度强化学习的三维水下AUV路径规划方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1451-1461.
[5] 赵威,张万枝,侯加林,侯瑞,李玉华,赵乐俊,程进. 基于改进深度强化学习算法的农业机器人路径规划[J]. 浙江大学学报(工学版), 2025, 59(7): 1492-1503.
[6] 张名芳,马健,赵娜乐,王力,刘颖. 无信号交叉口处基于深度强化学习的智能网联车辆运动规划[J]. 浙江大学学报(工学版), 2024, 58(9): 1923-1934.
[7] 叶宝林,孙瑞涛,吴维敏,陈滨,姚青. 基于异步优势演员-评论家的交通信号控制方法[J]. 浙江大学学报(工学版), 2024, 58(8): 1671-1680.
[8] 张萌,王殿海,金盛. 结合领域经验的深度强化学习信号控制方法[J]. 浙江大学学报(工学版), 2023, 57(12): 2524-2532.
[9] 姜玉峰,陈东生. 基于深度强化学习的大口径轴孔装配策略[J]. 浙江大学学报(工学版), 2023, 57(11): 2210-2216.
[10] 华夏,王新晴,芮挺,邵发明,王东. 视觉感知的无人机端到端目标跟踪控制技术[J]. 浙江大学学报(工学版), 2022, 56(7): 1464-1472.
[11] 刘智敏,叶宝林,朱耀东,姚青,吴维敏. 基于深度强化学习的交通信号控制方法[J]. 浙江大学学报(工学版), 2022, 56(6): 1249-1256.
[12] 邓齐林,鲁娟,陈勇辉,冯健,廖小平,马俊燕. 基于深度强化学习的数控铣削加工参数优化方法[J]. 浙江大学学报(工学版), 2022, 56(11): 2145-2155.
[13] 马一凡,赵凡宇,王鑫,金仲和. 基于改进指针网络的卫星对地观测任务规划方法[J]. 浙江大学学报(工学版), 2021, 55(2): 395-401.
[14] 葛晓波, 谢靓, 杨东武, 张树新, 杨癸庚. 大型构架式索网反射面天线机电集成设计[J]. 浙江大学学报(工学版), 2018, 52(4): 775-780.