Please wait a minute...
浙江大学学报(工学版)  2022, Vol. 56 Issue (2): 297-305    DOI: 10.3785/j.issn.1008-973X.2022.02.010
计算机与控制工程     
面向Flink流处理框架的主动备份容错优化
刘广轩1,2,3(),黄山1,2,3,*(),胡佳丽1,2,3,段晓东1,2,3
1. 大连民族大学 计算机科学与工程学院,辽宁 大连 116600
2. 大数据应用技术国家民委重点实验室,辽宁 大连 116600
3. 大连市民族文化数字技术重点实验室,辽宁 大连 116600
Fault tolerant optimization of active backup for Flink stream processing framework
Guang-xuan LIU1,2,3(),Shan HUANG1,2,3,*(),Jia-li HU1,2,3,Xiao-dong DUAN1,2,3
1. College of Computer Science and Technology, Dalian Minzu University, Dalian 116600, China
2. State Ethnic Affairs Commission Key Laboratory of Big Data Applied Technology, Dalian 116600, China
3. Dalian Key Laboratory of Digital Technology for National Culture , Dalian 116600, China
 全文: PDF(751 KB)   HTML
摘要:

针对Flink任务出现故障后因为全局卷回使流处理作业恢复效率低的问题,提出基于缓存队列的容错策略. 在作业中找出恢复时间最长的算子作为关键算子,将其处理过的数据存储到缓存队列中,并为其进行主动备份,备份算子同时接受来自上游的数据以达到在故障后作业可以瞬时恢复的效果. 为了解决主动备份带来的额外消耗,提出数据过滤算法,备份算子在每次处理数据前会到缓存组件中检索当前数据,以判断是否继续处理. 当Flink算子自身出现故障后,利用策略中的缓存队列与Flink的JobManager将故障发生时的数据信息发送给备份算子,在备份算子接收到数据后,实现即时恢复的效果. 利用4项评价指标对策略进行评估,结果表明,与Flink1.8的故障恢复模式相比,所提策略在Flink任务故障恢复速度上有显著提升,当故障次数分别为1、2、3、4时,恢复效率分别提高56.3%、51.3%、46.2%和45.8%;而在处理时延、CPU利用率以及内存使用率方面仅产生极小的代价.

关键词: Apache Flink流处理容错主动备份故障恢复缓存队列    
Abstract:

A fault-tolerant strategy based on cache queue was proposed, aiming at the problem of low efficiency of stream processing job recovery due to global rollback after Flink task fails. In the job, the operator with the longest recovery time is taken as the key operator, the processed data are stored in the buffer queue, and active backup is performed for it. The backup operator will also accept the data from the upstream to reach the the effect that the job can be restored instantaneously after a failure. In order to solve the additional consumption caused by active backup, a data filtering algorithm was proposed. The backup operator will retrieve the current data from the cache component before processing the data each time to determine whether to continue processing. When the Flink operator itself fails, it will use the buffer queue in the strategy and Flink’s JobManager to send the data information at the time of the failure to the backup operator. When the backup operator receives the data, it will realize the effect of instant recovery. The strategy was evaluated on four evaluation indicators. Compared with the failure recovery mode of Flink1.8, the proposed strategy had a significant improvement in Flink task failure recovery. The recovery efficiency was increased by 56.3%, 51.3%, 46.2% and 45.8% under failure times of 1, 2, 3 and 4 separately. And the proposed strategy brings only a very small price in terms of processing delay, CPU utilization and memory usage.

Key words: Apache Flink    stream processing fault tolerance    active backup    failure recovery    buffer queue
收稿日期: 2021-07-18 出版日期: 2022-03-03
CLC:  TP 316.4  
基金资助: 国家重点研发计划云计算和大数据重点专项(2018YFB1004402)
通讯作者: 黄山     E-mail: liuguangxuan_ustl@163.com;huangshan@dlnu.edu.cn
作者简介: 刘广轩(1997—),男,硕士生,从事大数据和流计算研究. orcid.org/0000-0003-3773-7594. E-mail: liuguangxuan_ustl@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
刘广轩
黄山
胡佳丽
段晓东

引用本文:

刘广轩,黄山,胡佳丽,段晓东. 面向Flink流处理框架的主动备份容错优化[J]. 浙江大学学报(工学版), 2022, 56(2): 297-305.

Guang-xuan LIU,Shan HUANG,Jia-li HU,Xiao-dong DUAN. Fault tolerant optimization of active backup for Flink stream processing framework. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 297-305.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2022.02.010        https://www.zjujournals.com/eng/CN/Y2022/V56/I2/297

图 1  Flink Job的StreamGraph
日期 开始时间 故障次数 完成时间 完成
1.7 8:00 0 8:08:20
1.8 8:00 1 8:09:10
1.9 8:00 2 8:10:02
1.10 8:00 3 8:10:23
1.11 8:00 1 8:09:08
表 1  Flink任务日志信息
P K1 K2 T
8 8 0 2000
12 8 4 1923
16 16 0 4335
表 2  增加并行度对吞吐量的影响
图 2  缓存队列容错策略工作流程
图 3  正常情况下缓存队列的拦截作用
图 4  备份算子处理数据示意图
图 5  算子自身故障恢复过程
硬件 配置信息
处理器 Intel i7-8700 CPU @ 3.20 GHz
CPU核心数 6核
主频 3.2 GHz
内存 64 GB
硬盘 100 GB SSD
表 3  Flink集群硬件配置信息
软件名 版本号 软件名 版本号
Centos OS 7.4.1 Java 1.8.0
Flink 1.8.0 Kafka 0.10.1.0
Hadoop 2.7.5 Flume 1.7.0
表 4  Flink集群软件配置信息
算子 TR/s 算子 TR/s
Source 45 Filter 67
Connet 65 Map 53
FlatMap 71 Sink 39
表 5  Flink任务故障平均恢复时间
数据量/万 处理时延/s
Flink-RestartAll Flink-BQBS
10 82.0 84.5
50 170.0 176.0
100 355.0 369.0
200 723.0 739.0
表 6  不同数据量下平均处理延迟对比
故障次数 Flink-RestartAll累计
重启时间/s
Flink BQBS故障
恢复时间/s
1 71 31
2 150 73
3 210 113
4 290 157
表 7  不同恢复方式下平均恢复时间对比
对比算法 恢复模式 TR/s ηn/% ηf/% δ/%
Flink-RestartAll 全局卷回 71 62.0 76 40.01
Flink-BQBS 单点恢复 31 62.5 82 42.70
表 8  Flink-RestartAll与Flink-BQBS对比信息
1 DEAN J, GHEMAWAT S MapReduce: a flexible data processing tool[J]. Communications of the ACM, 2010, 53 (1): 72- 77
doi: 10.1145/1629175.1629198
2 ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al Spark: cluster computing with working sets[J]. HotCloud, 2010, 10: 95
3 KATSIFODIMOS A, SCHELTER S. Apache Flink: stream analytics at scale [C]// IEEE International Conference on Cloud Engineering Workshop. Berlin: IC2EW, 2016: 28-38.
4 CHINTAPALLI S, DAGIT D, EVANS B, et al. Benchmarking streaming computation engines: storm, flink and spark streaming [C]// 2016 IEEE International Parallel and Distributed Processing Symposium Worshops. Chicago: IPDPSW, 2016: 1789-1792.
5 CHANDY K M, LAMPORT L Distributed snapshots: determining global states of a distributed system[J]. ACM Transactions on Computer Systems, 2016, 3 (1): 63- 75
6 BALAZINSKA M, BALAKRISHNAN H, MADDEN S R, et al Fault-tolerance in the Borealis distributed stream processing system[J]. ACM Transactions on Database Systems, 2008, 33 (1): 81- 124
7 HWANG J H, CETINTEMEL U, ZDONIK S. Fast and highly-available stream processing over wide area networks [C]// 2008 IEEE 24th International Conference on Data Engineering. Cancun: ICDE, 2008: 804-813.
8 HEINZE T, ZIA M, KRAHN R, et al. An adaptive replication scheme for elastic data stream processing systems [C]// Proceedings of the 9th ACM International Conference on Distributed Event-based Systems. Oslo: DEBS, 2015 : 150-161.
9 CHANDRAMOULI B, GOLDSTEIN J Shrink: prescribing resiliency solutions for streaming[J]. Proceedings of the VLDB Endowment, 2017, 10 (5): 505- 516
10 BENOIT A, RAINA S K, ROBERT Y Efficient checkpoint/verification patterns[J]. International Journal of High Performance Computing Applications, 2017, 31 (1): 52- 65
doi: 10.1177/1094342015594531
11 AKBER S M A , CHEN H , WANG Y , et al. Minimizing overheads of checkpoints in distributed stream processing systems [C]// 2018 IEEE 7th International Conference on Cloud Networking. Tokyo: CloudNet, 2018: 1-4.
12 ZHUANG Y, WEI X, LI H, et al. An optimal checkpointing model with online OCI adjustment for stream processing applications [C]// International Conference on Computer Communication and Networks. Hangzhou: ICCCN, 2018: 1-9.
13 LOMBARDI F, ANIELLO L, BONOMI S, et al Elastic symbiotic scaling of operators and resources in stream processing systems[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29 (99): 572- 585
14 刘智亮. 面向流数据处理的动态自适应检查点机制研究[D]. 吉林: 吉林大学, 2017.
LIU Zhi-liang . Research on adaptive checkpoint mechanism for large-scale streaming data processing [D]. Jilin: Jilin University, 2017.
15 郭文鹏, 赵宇海, 王国仁, 等 面向Flink迭代计算的高效容错处理技术[J]. 计算机学报, 2020, 43 (11): 2101- 2118
GUO Wen-peng, ZHAO Yu-hai, WANG Guo-ren, et al Efficient fault-tolerant processing technology for Flink iterative computing[J]. Chinese Journal of Computers, 2020, 43 (11): 2101- 2118
doi: 10.11897/SP.J.1016.2020.02101
16 HAN D Z, CHEN X G, LEI Y X, et al Real-time data analysis system based on Spark Streaming and its application[J]. Journal of Computer Applications, 2017, 37 (5): 1263- 1269
[1] 吕娜,刘创,陈柯帆,曹芳波. 考虑控制器故障的软件定义机载网络选举算法[J]. 浙江大学学报(工学版), 2019, 53(4): 785-793.