Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (7): 1528-1538    DOI: 10.3785/j.issn.1008-973X.2026.07.015
    
Deep reinforcement learning models and algorithms for single-machine scheduling considering deteriorated maintenance
Yong CHEN(),Xizhi DU,Yiwei JIANG,Wenchao YI*(),Zhi PEI,Zuzhen JI
College of Mechanical Engineering, Zhejiang University of Technology, Hangzhou 310023, China
Download: HTML     PDF(1712KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A multi-stage machine state model was proposed to address the single-machine scheduling problem under machine degradation and maintenance strategies, with the objective of minimizing total production cost. A state transition mechanism was designed to incorporate both degradation evolution and maintenance effects. Job tardiness cost, machine operating cost, and maintenance cost were jointly considered to improve economic efficiency in production of the entire production process. An integrated decision-making framework for scheduling and maintenance based on deep reinforcement learning was developed, in which the Agent was trained through interaction with the environment to learn optimized scheduling and maintenance strategies. Joint decisions on job sequencing and maintenance timing were realized in complex dynamic systems. Benchmark instances of various scales were designed, and the effectiveness of the proposed model and framework was validated through computational experiments. The results indicate that the proposed approach achieves better performance in minimizing total scheduling and maintenance costs compared with several integrated optimization strategies. The conflict between production scheduling and machine maintenance is effectively balanced, and a more advantageous integrated optimization strategy for scheduling and maintenance is realized in dynamic and uncertain environments.



Key wordssingle-machine scheduling      equipment maintenance      deep reinforcement learning      deterioration effect      integration optimization     
Received: 19 February 2025      Published: 23 May 2026
CLC:  TP 181  
Fund:  国家自然科学基金重点资助项目(W2411062);浙江省自然科学基金资助项目(LGG22G010002);国家自然科学基金资助项目(52005447, 71871203).
Corresponding Authors: Wenchao YI     E-mail: cy@zjut.edu.cn;yiwenchao@zjut.edu.cn
Cite this article:

Yong CHEN,Xizhi DU,Yiwei JIANG,Wenchao YI,Zhi PEI,Zuzhen JI. Deep reinforcement learning models and algorithms for single-machine scheduling considering deteriorated maintenance. Journal of ZheJiang University (Engineering Science), 2026, 60(7): 1528-1538.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.07.015     OR     https://www.zjujournals.com/eng/Y2026/V60/I7/1528


考虑劣化维护的单机调度深度强化学习模型和算法

针对单台机器在考虑劣化效应与维护策略下的调度问题,提出多阶段机器状态模型. 以最小化生产总成本为目标,设计结合劣化演化和维护效果的状态转移机制,综合考虑作业延迟成本、机器运行成本和维护成本,旨在使整个生产过程更加经济和高效. 基于深度强化学习方法构建调度与维护一体化决策模型框架,通过训练Agent在与环境交互中学习优化策略,实现对复杂动态系统中作业调度与维护时机的联合决策. 设计多种规模的算例并验证框架和模型对结果优化的有效性. 实验对比结果表明,所提出的模型框架及算法在作业调度和维护总成本控制方面相较于多种综合优化策略方法具有更优表现,能够有效协调作业调度与设备维护的冲突关系,在动态不确定环境下实现更具优势的调度和维护一体化的优化策略学习和应用.


关键词: 单机调度,  设备维护,  深度强化学习,  劣化效应,  集成优化 
Fig.1 Deterioration effect process
Fig.2 Transition of state
Fig.3 Scheduling decision model
符号动作描述数学形式
a1SPT最短加工时间优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;{p}_{j} $
a2LPT最长加工时间优先$ {\mathrm{m}\mathrm{a}\mathrm{x}}_{j\in B}\;{p}_{j} $
a3EDD最早交付期优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;{d}_{j} $
a4FCFS最早到达时间优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;{\mathrm{a}\mathrm{r}\mathrm{r}\mathrm{i}\mathrm{v}\mathrm{e}}_{j} $
a5MST最小松弛时间$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;\left({d}_{j}-\left(t+{p}_{j}\right)\right) $
a6CR最小临界比率$ {\mathrm{min}}_{j\in B}\;\left({d}_{j}-t\right)/{p}_{j} $
a7MDD修正交付时间优先$ {\mathrm{m}\mathrm{i}\mathrm{n}}_{j\in B}\;\mathrm{m}\mathrm{a}\mathrm{x}\;({d}_{j},t+{p}_{j}) $
a8PM执行不完全维护$ {M}_{i-1}+{R}_{\mathrm{P}\mathrm{M}} $
a9CM执行完全维护$ {M}_{i-1}+{R}_{\mathrm{C}\mathrm{M}} $
Tab.1 Description of action space
Fig.4 Machine status and scheduling strategy reward curve
Fig.5 Single strategy running step and reward curve
Fig.6 Network structure of DRL algorithm
Fig.7 Algorithm flow of DRL
规模SPT-MLPT-MFCFS-MEDD-M
MeanStdMeanStdMeanStdMeanStd
10200.00.0258.00.0192.00.0227.00.0
201005.610.21682.0108.81058.727.11334.282.1
302160.658.13725.4110.32678.0113.13075.8143.8
503610.0157.58074.8212.74679.4291.35065.8229.9
8011198.2450.422474.8752.114871.1597.915306.1665.6
10016709.3553.833877.21188.921372.8856.823350.4886.7
15032096.01198.471634.72146.842847.31791.347919.01474.0
规模MST-MCR-MMDD-M基准
MeanStdMeanStdMeanStdMeanStd
10197.00.0226.00.0191.00.0191.00.0
201163.020.61053.527.7973.822.5973.822.5
302879.373.22269.0109.52103.9102.92103.9102.9
504858.1307.33597.3191.63310.6172.13310.6172.1
8015181.1687.811678.6519.010642.0494.010642.0494.0
10021907.0937.216753.6800.715763.6766.815763.6766.8
15043670.21579.232857.91266.630546.11166.530546.11166.5
Tab.2 Mean and standard deviation of cost with R-M integrated optimization strategies
规模Min
SPT-MLPT-MEDD-MFCFS-MMST-MCR-MMDD-M
10200258192227197226191
2095615651008.041255.67510861006904
302017344124632831263220701939
50324572734314.024565425132212940
801021820871134071371113779103229494
10015076310611924121190192491480614030
Tab.3 Minimum of cost with R-M integrated strategies
环境参数描述
n工件数量规模10, 20, 30, 50, 80, 100, 150
pj工件j的加工时间Discrete U (1, 10)
dj工件j的交付时间pj +Discrete U (n, 3n)
α延迟交付的成本系数1
β机器损耗的成本系数5
M0机器初始状态值100
tCM, tPMCM 和 PM 的时间20, 10
CCM, CPMCM 和 PM 的成本60, 40
Cbroke损坏的额外修复成本100
NScale奖励值归一化尺度100
h1, h2劣化和失效效应点60, 0
σ劣化效应因子0.05
M1, M2超出边界状态的惩罚基数2, 2
Tab.4 Environment parameter setting
超参数DQN算法PPO算法A2C算法
最大训练步长1.6×1071.6×1071.6×107
批量大小256256?
环境交互时间步?2048256
隐藏层节点数2×2562×2562×256
回放池大小1×107??
网络同步频率1×104??
学习率1×10?41×10?41×10?4
折扣因子 γ0.990.990.99
初始探索率0.01?0.01
λ(GAE)?0.950.95
熵正则化系数?0.050.05
初始探索率1??
探索衰减率0.99??
最小探索率0.001??
Tab.5 Hyperparameter settings of DRL algorithms
Fig.8 Influence curves of scale on average reward of episode
Fig.9 Influence curves of scale on average step of episode
Fig.10 Calculation speed curves with scale of 80
规模算法时间步/106平均FPS训练时长/h
10A2C11498.40.18
DQN11660.80.27
PPO11141.60.24
20A2C21473.70.37
DQN21440.60.54
PPO21109.60.50
30A2C21447.30.38
DQN21393.90.54
PPO21093.30.51
50A2C41400.90.79
DQN41283.41.11
PPO41065.41.05
80A2C81360.61.63
DQN81154.32.27
PPO81038.62.15
100A2C81350.41.66
DQN81121.42.31
PPO81018.62.19
150A2C161276.13.51
DQN161051.94.68
PPO161007.24.33
Tab.6 Training performance comparison of DRL algorithms
规模基准A2CDQNPPO
MeanStdMeanStdMeanStdMeanStd
10191.00.0191.41.0217.861.7192.514.8
20973.822.5908.09.5925.647.6901.56.1
302103.9102.91920.820.11948.3109.01880.56.6
503310.6172.13009.641.16177.72280.42936.719.0
8010642.0494.09712.765.110015.8480.49469.740.3
10015763.6766.814551.0172.214737.7708.414020.2119.5
15030546.11166.527272.6270.028197.41209.327073.3275.1
Tab.7 Optimized cost mean and standard deviation of DRL algorithms
Fig.11 Comparison of optimization cost results of different algorithms
规模Min
R-MDQNA2CPPO
10191.0191.0191.0191.0
20904.0895.0901.0900.0
301939.01878.01887.01878.0
502940.02948.03246.42911.6
809494.09550.39566.69397.0
10014030.014210.013909.613818.0
15027519.026743.826560.926600.9
Tab.8 Optimized minimum cost of DRLs
规模A2CDQNPPO
$\Delta {\mathrm{mean}} $/%$\Delta \min $/%$\Delta {\mathrm{mean}} $/%$\Delta \min $/%$\Delta {\mathrm{mean}} $/%$\Delta \min $/%
10?0.230.00?12.310.00?0.780.00
207.251.015.210.338.010.44
309.533.257.982.7611.883.25
5010.00?0.27?46.41?9.4412.730.98
809.57?0.596.25?0.7612.381.03
1008.33?1.276.960.8712.431.53
15012.002.908.333.6112.833.45
Tab.9 Cost optimization effect of DRL methods
规模方法MinMean
中位数秩和中位数秩和
7PPO2931.8102957.310
7MDD-M2945.0153363.924
7A2C3027.4183058.814
7DQN3672.7363819.031
7SPT-M3263.9373633.036
7CR-M3258.3413644.542
7EDD-M4258.6454649.844
7MST-M4263.8524837.552
7FCFS-M4534.5615040.862
p1=0.00p2=0.00
Tab.10 Sorting results of Friedman test by different methods
Fig.12 Scheduling result and records with scale of 50
[1]   JIA J, LU C, YIN L Energy saving in single-machine scheduling management: an improved multi-objective model based on discrete artificial bee colony algorithm[J]. Symmetry, 2022, 14 (3): 561
doi: 10.3390/sym14030561
[2]   ZHANG G, HU Y, SUN J, et al An improved genetic algorithm for the flexible job shop scheduling problem with multiple time constraints[J]. Swarm and Evolutionary Computation, 2020, 54: 100664
doi: 10.1016/j.swevo.2020.100664
[3]   HAJEJ Z, REZG N, ASKRI T Joint optimization of capacity, production and maintenance planning of leased machines[J]. Journal of Intelligent Manufacturing, 2020, 31 (2): 351- 374
doi: 10.1007/s10845-018-1450-7
[4]   DURAN TOKSARı M A branch and bound algorithm to minimize the single machine maximum tardiness problem under effects of learning and deterioration with setup times[J]. RAIRO - Operations Research, 2016, 50 (1): 211- 219
doi: 10.1051/ro/2015026
[5]   ZHANG X, XIA T, PAN E, et al Integrated optimization on production scheduling and imperfect preventive maintenance considering multi-degradation and learning-forgetting effects[J]. Flexible Services and Manufacturing Journal, 2022, 34 (2): 451- 482
doi: 10.1007/s10696-021-09410-1
[6]   SUN X, GENG X N Single-machine scheduling with deteriorating effects and machine maintenance[J]. International Journal of Production Research, 2019, 57 (10): 3186- 3199
doi: 10.1080/00207543.2019.1566675
[7]   GHALEB M, TAGHIPOUR S, SHARIFI M, et al Integrated production and maintenance scheduling for a single degrading machine with deterioration-based failures[J]. Computers and Industrial Engineering, 2020, 143: 106432
doi: 10.1016/j.cie.2020.106432
[8]   PAPROCKA I, KRENCZYK D, BURDUK A The method of production scheduling with uncertainties using the ants colony optimisation[J]. Applied Sciences, 2021, 11 (1): 171
[9]   宋文家, 张超勇, 尹勇, 等 基于多目标混合殖民竞争算法的设备维护与车间调度集成优化[J]. 中国机械工程, 2015, 26 (11): 1478- 1487
SONG Wenjia, ZHANG Chaoyong, YIN Yong, et al Integrated optimization of equipment maintenance and shop scheduling problem based on multi-objective hybrid imperialist competitive algorithm[J]. China Mechanical Engineering, 2015, 26 (11): 1478- 1487
doi: 10.3969/j.issn.1004-132X.2015.11.010
[10]   甘婕, 侯青玉, 汪思宇, 等 流水车间调度与视情维修的联合决策[J]. 工业工程与管理, 2023, 28 (1): 207- 214
GAN Jie, HOU Qingyu, WANG Siyu, et al The joint decision and optimization of flow-shop scheduling and condition based maintenance[J]. Industrial Engineering and Management, 2023, 28 (1): 207- 214
[11]   甘婕, 曾建潮 考虑劣化状态的单机调度与维修决策集成模型[J]. 控制与决策, 2016, 31 (3): 513- 520
GAN Jie, ZENG Jianchao Integrated model of single-machine scheduling and maintenance decision for degrading state systems[J]. Control and Decision, 2016, 31 (3): 513- 520
[12]   张昕莹, 陈璐, 杨雯惠 考虑系统时变效应与预防性维护的平行机调度[J]. 浙江大学学报: 工学版, 2022, 56 (2): 408- 418
ZHANG Xinying, CHEN Lu, YANG Wenhui A parallel-machine scheduling problem with time-changing effect and preventive maintenance[J]. Journal of Zhejiang University: Engineering Science, 2022, 56 (2): 408- 418
[13]   杨宏兵, 沈露, 成明, 等 带退化效应多态生产系统调度与维护集成优化[J]. 计算机集成制造系统, 2018, 24 (1): 80- 88
YANG Hongbing, SHEN Lu, CHENG Ming, et al Integrated optimization of scheduling and maintenance in multi-state production systems with deterioration effects[J]. Computer Integrated Manufacturing Systems, 2018, 24 (1): 80- 88
[14]   YANG H, LI W, WANG B Joint optimization of preventive maintenance and production scheduling for multi-state production systems based on reinforcement learning[J]. Reliability Engineering and System Safety, 2021, 214: 107713
doi: 10.1016/j.ress.2021.107713
[15]   LAMPRECHT R, WURST F, HUBER M F. Reinforcement learning based condition-oriented maintenance scheduling for flow line systems [EB/OL]. [2025-01-01]. https://ieeexplore.ieee.org/document/9557373/.
[16]   SALMASNIA A, SHABANI A Opportunistic maintenance modeling for series production systems based on bottleneck by considering energy consumption and market demand[J]. Journal of Industrial and Production Engineering, 2023, 40 (6): 506- 518
doi: 10.1080/21681015.2023.2234377
[17]   YU M, LI T, MA J. Joint optimization method of production scheduling for prefabricated components based on preventive maintenance [C]// 41st Chinese Control Conference. Hefei: IEEE, 2022: 1940–1944.
[18]   杨梦月, 董文杰, 刘思峰 基于2种周期维护类型和序列准备时间的单机调度[J]. 控制与决策, 2024, 39 (10): 3488- 3496
YANG Mengyue, DONG Wenjie, LIU Sifeng Single machine scheduling based on two types of periodic maintenance and sequence-dependent setup times[J]. Control and Decision, 2024, 39 (10): 3488- 3496
[19]   KANG K, SUBRAMANIAM V Integrated control policy of production and preventive maintenance for a deteriorating manufacturing system[J]. Computers and Industrial Engineering, 2018, 118: 266- 277
doi: 10.1016/j.cie.2018.02.026
[20]   XANTHOPOULOS A S, KIATIPIS A, KOULOURIOTIS D E, et al Reinforcement learning-based and parametric production-maintenance control policies for a deteriorating manufacturing system[J]. IEEE Access, 2017, 6: 576- 588
[21]   MNIH V, KAVUKCUOGLU K, SILVER D, et al Human-level control through deep reinforcement learning[J]. Nature, 2015, 518 (7540): 529- 533
doi: 10.1038/nature14236
[22]   LUO S Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning[J]. Applied Soft Computing, 2020, 91: 106208
doi: 10.1016/j.asoc.2020.106208
[23]   LIU R, PIPLANI R, TORO C Deep reinforcement learning for dynamic scheduling of a flexible job shop[J]. International Journal of Production Research, 2022, 60 (13): 4049- 4069
doi: 10.1080/00207543.2022.2058432
[24]   HAN B A, YANG J J Research on adaptive job shop scheduling problems based on dueling double DQN[J]. IEEE Access, 2020, 8: 186474- 186495
doi: 10.1109/ACCESS.2020.3029868
[1] Yiwei ZHANG,Xin CUI,Qinghui ZHAO,Yan CHEN. Collaborative content caching optimization in UAV-assisted internet of vehicle based on NOMA[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(6): 1289-1298.
[2] Qingqing YANG,Runpeng TANG,Yi PENG. Joint waveform and phase shift design in integrated sensing and communication systems[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 906-914.
[3] Jiale LIU,Yali XUE,Shan CUI,Jun HONG. TD3 mapless navigation algorithm guided by dynamic window approach[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1671-1679.
[4] Kun HAO,Xuan MENG,Xiaofang ZHAO,Zhisheng LI. 3D underwater AUV path planning method integrating adaptive potential field method and deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1451-1461.
[5] Wei ZHAO,Wanzhi ZHANG,Jialin HOU,Rui HOU,Yuhua LI,Lejun ZHAO,Jin Cheng. Path planning of agricultural robots based on improved deep reinforcement learning algorithm[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1492-1503.
[6] Mingfang ZHANG,Jian MA,Nale ZHAO,Li WANG,Ying LIU. Intelligent connected vehicle motion planning at unsignalized intersections based on deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1923-1934.
[7] Baolin YE,Ruitao SUN,Weimin WU,Bin CHEN,Qing YAO. Traffic signal control method based on asynchronous advantage actor-critic[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(8): 1671-1680.
[8] Meng ZHANG,Dian-hai WANG,Sheng JIN. Deep reinforcement learning approach to signal control combined with domain experience[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2524-2532.
[9] Yu-feng JIANG,Dong-sheng CHEN. Assembly strategy for large-diameter peg-in-hole based on deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(11): 2210-2216.
[10] Xia HUA,Xin-qing WANG,Ting RUI,Fa-ming SHAO,Dong WANG. Vision-driven end-to-end maneuvering object tracking of UAV[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(7): 1464-1472.
[11] Zhi-min LIU,Bao-Lin YE,Yao-dong ZHU,Qing YAO,Wei-min WU. Traffic signal control method based on deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(6): 1249-1256.
[12] Qi-lin DENG,Juan LU,Yong-hui CHEN,Jian FENG,Xiao-ping LIAO,Jun-yan MA. Optimization method of CNC milling parameters based on deep reinforcement learning[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2145-2155.
[13] Yi-fan MA,Fan-yu ZHAO,Xin WANG,Zhong-he JIN. Satellite earth observation task planning method based on improved pointer networks[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(2): 395-401.