Cloud resource scheduling algorithm with failure recovery mechanism
QI Ping1,2, LI Long shu1, LI Xue jun1
1. Department of Computer Science and Technology, Anhui University, Hefei 230039, China;2. Department of Mathematics and Computer Science, Tongling University, Tongling 244000, China
A task scheduling model with fault recovery mechanism was proposed aimed at the problem of low reliability in Cloud service. The behavior characteristic of nodes was analyzed by using failure recovery mechanism, and the interaction failures between nodes were classified into two categories, including unrecoverable failures and recoverable failures. A more practical Cloud service reliability model was proposed through quantifying and evaluating the trustworthiness of computing nodes by referring to the social trust relationship. The constraints on the numbers of recoveries performed and the recoverability probability could be adjusted freely by resource owners. A dynamic level scheduling (DLS) algorithm considering fault recovery mechanism named FR DLS was proposed by integrating the trustworthiness of the nodes into the existing DLS algorithm. The FR DLS algorithm takes the Cloud service resources’ trust degree into account when calculating the scheduling level of task resource pairs. Accordingly, the tasks could be executed on trust nodes efficiently. A simulation platform based on CloudSim in PlanetLab was developed in order to evaluate the proposed algorithm. The theoretical analyses and simulation experimental results prove that the proposed FR DLS algorithm can efficiently improve the mission success rate in cloud environment at the expense of relatively fewer execution time and scheduling length. With the increasing number of nodes and tasks, the increased performance in reliability is much higher than that in the cost of execution time and scheduling length, verifying the practicability in large scale Cloud environment.
QI Ping, LI Long shu, LI Xue jun. Cloud resource scheduling algorithm with failure recovery mechanism. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2015, 49(12): 2305-2315.
[1] BUYYA R, YEO C S. VENUGOPAL S, et al. Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility [J]. Future Generation Computer Systems, 2009,25(6): 599-616.
[2] DARBHA S, AGRAWAL D P. Optimal scheduling algorithm for distributed memory machines [J]. IEEE Transactions on Parallel and Distributed Systems, 2002,9(1): 87-95.
[3] LEE Y C, ZOMAYA A Y. A novel state transition method for metaheuristic based scheduling in heterogeneous computing systems [J]. IEEE Transactions on Parallel and Distributed Systems, 2008, 19 (9):1215-1223.
[4] ZHU D, MOSSE D, MELHEM R. Power aware scheduling for and/or graphs in real time systems [J]. IEEE Transactions on Parallel and Distributed Systems, 2004, 15(9) : 849-864.
[5] KIM K H, BUYYA R, KIM J. Power aware scheduling of bag of tasks applications with deadline constraints on DVS enabled clusters [C] ∥ Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid. Rio de janeiro: IEEE, 2007: 541-548.
[6] BUNDE D P. Power aware scheduling for makespan and flow [J]. Journal of Scheduling, 2009, 12 (5):489-500.
[7] 李茂胜,杨寿保,付前飞,等.基于赔偿的网格资源交易模型[J].软件学报,2006,17(3):472-480.
LI Mao sheng, YANG Shou bao, FU Qian fei, et al. A grid resource transaction model based on compensation [J]. Journal of Software, 2006, 17(3): 472-480.
[8] XU B M, ZHAO C Y, HU E Z, et al. Job scheduling algorithm based on Berger model in cloud environment [J]. Advances in Engineering Software, 2011, 42(3): 419-425.
[9] BUYYA R, MURSHED M M, ABRAMSON D, et al. Scheduling parameter sweep applications on global grids: a deadline and budget constrained cost time optimization algorithm [J]. Software Practice and Experience, 2005, 35(5): 491-512.
[10] BLANCO C V, HUEDO E, MONTERO R S, et al. Dynamic provision of computing resources from grid infrastructures and cloud providers [C] ∥ Grid and Pervasive Computing Conference, Geneva: IEEE, 2009: 113-120.
[11] TOPCUOGLU H, HARIRI S, WU M Y. Performance effective and low complexity task scheduling for heterogeneous computing [J]. IEEE Transactions on Parallel and Distributed Systems, 2002,13(3):260-274.
[12] MEZMAZ M, MELAB N, KESSACI Y, et al. A parallel bi objective hybrid metaheuristic for energy aware scheduling for cloud computing systems [J]. Journal of Parallel and Distributed Computing, 2011, 71(10): 1497-1508.
[13] DOGAN A, OZGUNER, F. Reliable matching and scheduling of precedence constrained tasks in heterogeneous distributed computing [C] ∥ Proceedings of the 29th international conference on parallel processing, Toronto: IEEE, 2000: 307-314.
[14] DOGAN A, OZGUNER, F. Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing [J]. IEEE Transactions on Parallel and Distributed Systems, 2002, 13(3): 308-323.
[15] DAI Y S, XIE M. Reliability of grid service systems [J]. Computers and Industrial Engineering, 2006,50(1/2): 130-147.
[16] LEVITIN G, DAI Y S. Service reliability and performance in grid system with star topology [J]. Reliability Engineering and System Safety,2007, 92(1): 40-46.
[17] FOSTER I, ZHAO Y, RAICU I, et al. Cloud Computing and Grid Computing 360 Degree Compared [M].Texa: IEEE Grid Computing Environments, 2008:1014-1017.
[18] BLAZE M. Toward a broader view of security protocols [C]∥ 12th Cambridge International Workshop on Security Protocols, Cambridge: IEEE, 2004: 1014-1017.
[19] JOSANG A, OZGUNER F. Matching and scheduling of minimizing execution time and failure probability of applications in heterogeneous computing [J]. IEEE Transactions on Parallel and Distributed Systems, 2002, 13(3): 308-323.
[20] JOSANG A. A logic for uncertain probabilities [J]. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 2001, 9(3): 279-311.
[21] WANG W, ZENG G S, YUAN L L. A reputation multi agent system in semantic web [C] ∥ Proceedingsof the 9th Pacific Rim International Workshop on Multi Agents. Guilin: PRIMA, 2006: 211-219.
[22] WANG W, ZENG G S, TANG D Z, et al. Cloud DLS: dynamic trusted scheduling for cloud computing [J]. Expert Systems with Applications, 2012, 39(5): 2321-2329.
[23] DAI Y S,WANG X L. Optimal resource allocation on grid systems for maximizing service reliability using a genetic algorithm [J]. Reliability Engineering and System Safety, 2006, 91(9): 1071-1082.
[24] TREASTER M. A survey of fault tolerance and fault recovery techniques in parallel systems [R]. ACM Computing Research Repository (CoRR), 2005: 1-11.
[25] ABAWAJY J H. Fault tolerant scheduling policy for grid computing systems [C] ∥ Proceedings of the 19th IEEE International conference on Parallel and Distributed Processing Symposium, New York: IEEE, 2004: 50-58.
[26] JOSANG A, ISMAIL R. The beta reputation system [C] ∥ Proceedings of the 15th Bled Conference on Electronic Commerce, Slovenia: Bled EC, 2002:2502-2511.
[27] GUO S C, HUANG H Z,LIU Y. Modeling and analysis of grid service reliability considering fault recovery [J]. New Generation Computing, 2011,29(4):345-364.
[28] THOMAS L, JOHN S J. Bayesian methods: an analysis of statisticians and interdisciplinary [M]. New York: Cambridge University Press, 1999: 341-355.
[29] SULISTIO A, CIBEJ U, VENUGOPAL S, et al. A tutorial for modeling and simulating data grids: an extension to Gridsim [J].Concurrency and Computation: Practice and Experience, 2008, 20(13): 1591-1609.
[30] ZHU M, GUO W, XIAO S L, et al. Availability driven scheduling for real time directed acyclic graph applications in optical grid [J]. Journal of Optical Communications and Networking, 2010, 2(7): 469-480.
[31] PETERSON L, BAVIER A, FIUCZYNSK M, et al. Towards a comprehensive PlanetLab architecture technical report PDN 05 030 [R]. Sydney: PlanetLab Consortium, 2005.
[32] CALHEIROS R N, RANJAN R, ROSE D, et al. CloudSim: a novel framework for modeling and simulation of cloud computing infrastructures and services [R]. Melbourne: Grid Computing and Distributed Systems Laboratory, the University of Melbourne,2009.