Please wait a minute...
浙江大学学报(工学版)  2023, Vol. 57 Issue (2): 320-329    DOI: 10.3785/j.issn.1008-973X.2023.02.012
计算机技术     
支持日志乱序提交的分布式一致性协议
王进(),李博涵*(),吴佳骏,宋欣洋
南京航空航天大学 计算机科学与技术学院,江苏 南京 211106
Distributed consistency protocol supporting log submission out-of-order
Jin WANG(),Bo-han LI*(),Jia-jun WU,Xin-yang SONG
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
 全文: PDF(1551 KB)   HTML
摘要:

为了解决Raft算法中日志提交的严格串行化设计,提出一种Raft协议变体:并行提交Raft (PC-Raft). PC-Raft在日志提交阶段运用流水线,重新设计日志缓冲区,实现日志的乱序提交. 传输方式使用RDMA网络,在提高日志传输的并发性的同时加快传输速度. 在日志执行阶段,采用批处理,将多条指令打包发送给状态机逐条执行. 针对日志并行提交情况下会出现的幽灵复现问题,采用LSN与任期号结合的方式保证日志的安全性. 针对日志乱序提交会出现的日志空洞问题,重新设计领导者选举算法,在选举中加入临时领导者,保证选举出的领导者能最快恢复日志. 测试结果证明PC-Raft对比Raft在吞吐量方面有着明显的性能提升,同时延迟更低,并且在日志指令依赖频繁的情况下,吞吐量比现有基于Raft的变体更高,延迟也更低.

关键词: 一致性协议PaxosRaftPC-Raft乱序提交    
Abstract:

A Raft protocol variant, parallel commit Raft (PC-Raft), was proposed, in order to solve the strict serialization design of log submission in Raft algorithm. In PC-Raft, pipeline was used in the log submission stage and the log buffer was redesigned to realize out-of-order submission of logs. The RDMA network was used as the transmission method to improve the concurrency of log transmission and speed up the transmission. In the log execution stage, batch processing was used to package multiple instructions and send them to the state machine for execution one by one. For the ghost recurrence problem that occurs, the combination of LSN and term number was used to ensure the security of the log. In order to solve the problem of log holes caused by disordered log submission, the leader election algorithm was redesigned, and temporary leaders were added to the election to ensure that the elected leaders can recover the logs as quickly as possible. Test results show that PC-Raft has a significant performance improvement in throughput compared with Raft, and the delay is lower. In the case of frequent log instruction dependency, the throughput is higher and the latency is lower compared with the existing Raft based variant.

Key words: consistency protocol    Paxos    Raft    PC-Raft    out-of-order submission
收稿日期: 2022-07-31 出版日期: 2023-02-28
CLC:  TP 311  
基金资助: CCF-华为数据库系统创新研究计划资助项目(CCF HUAWEIDBIR2020001A)
通讯作者: 李博涵     E-mail: 1598076545@qq.com;bhli@nuaa.edu.cn
作者简介: 王进(1997—),男,硕士生,从事数据库一致性协议研究. orcid.org/0000-0001-5333-0271. E-mail: 1598076545@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王进
李博涵
吴佳骏
宋欣洋

引用本文:

王进,李博涵,吴佳骏,宋欣洋. 支持日志乱序提交的分布式一致性协议[J]. 浙江大学学报(工学版), 2023, 57(2): 320-329.

Jin WANG,Bo-han LI,Jia-jun WU,Xin-yang SONG. Distributed consistency protocol supporting log submission out-of-order. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 320-329.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2023.02.012        https://www.zjujournals.com/eng/CN/Y2023/V57/I2/320

图 1  Raft协议日志复制过程
图 2  流水线并行提交日志
图 3  批处理打包执行
轮次 Leader A B C
Round 1 A 1~8 1~3 1~3
Round 2 B Down 1~5,10 1~5,10
Round 3 C 1~10 1~10 1~10
表 1  幽灵复现示例
图 4  Raft协议的栏杆设置
图 5  日志缓冲区结构设计
图 6  PC-Raft的日志复制过程
图 7  节点状态变迁
图 8  领导者竞选流程
图 9  临时Leader的日志恢复流程
图 10  不同客户端数量下的吞吐量
图 11  客户端数量对延迟的影响
图 12  负载倾斜下的吞吐量
图 13  数据倾斜对吞吐率的影响
图 14  副本个数对吞吐量的影响
图 15  副本个数对延迟的影响
1 KEMME B, ALONSO G. Don't be lazy, be consistent: Postgres-R, a new way to implement database replication [C]// VLDB. Portugal: [s.n.], 2000: 134-143.
2 WIESMANN M, PEDONE F, SCHIPER A, et al. Database replication techniques: a three parameter classification [C]// Proceedings of 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000. [s.l.]: IEEE, 2000: 206-215.
3 STONEBRAKER M Concurrency control and consistency of multiple copies of data in distributed INGRES[J]. IEEE Transactions on Software Engineering, 1979, 5 (3): 188- 194
4 SCHNEIDER F B Implementing fault-tolerant services using the state machine approach: a tutorial[J]. ACM Computing Surveys (CSUR), 1990, 22 (4): 299- 319
doi: 10.1145/98163.98167
5 LAMPORT L. The part-time parliament [M]// Concurrency: the Works of Leslie Lamport, 2019: 277-317.
6 KIRSCH J, AMIR Y. Paxos for system builders: an overview [C]// Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware. Portugal: [s.n.], 2008: 1-6.
7 CHARAPKO A, AILIJIANG A, DEMIRBAS M. Linearizable quorum reads in Paxos [C]// 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). Portugal: [s.n.], 2019.
8 ZHENG J, LIN Q, XU J, et al PaxosStore: high-availability storage made practical in WeChat[J]. Proceedings of the VLDB Endowment, 2017, 10 (12): 1730- 1741
doi: 10.14778/3137765.3137778
9 ONGARO D, OUSTERHOUT J. In search of an understandable consensus algorithm [C]// 2014 USENIX Annual Technical Conference (Usenix ATC 14). Portugal: [s. n. ], 2014: 305-319.
10 GRAY J N. Notes on data base operating systems [M]// Operating systems. Berlin, Heidelberg: Springer, 1978: 393-481.
11 SKEEN D. Nonblocking commit protocols [C]// Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data. Portugal: [s.n.], 1981: 133-142.
12 GIFFORD D K. Weighted voting for replicated data [C]// Proceedings of the 7th ACM Symposium on Operating Systems Principles. Portugal: [s.n.], 1979: 150-162.
13 LAMPORT L Fast paxos[J]. Distributed Computing, 2006, 19 (2): 79- 103
doi: 10.1007/s00446-006-0005-x
14 CHANDRA T D, GRIESEMER R, REDSTONE J. Paxos made live: an engineering perspective [C]// Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. Portugal: [s.n.], 2007: 398-407.
15 HUNT P, KONAR M, JUNQUEIRA F P, et al. ZooKeeper: wait-free coordination for internet-scale systems [C]// 2010 USENIX Annual Technical Conference (USENIX ATC 10). Portugal: [s.n.], 2010.
16 MORARU I, ANDERSEN D G, KAMINSKY M. There is more consensus in egalitarian parliaments [C]// Proceedings of the 24th ACM Symposium on Operating Systems Principles. Portugal: [s.n.], 2013: 358-372.
17 POKE M, HOEFLER T. Dare: high-performance state machine replication on RDMA networks [C]// Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). Portugal: [s.n.], 2015: 107-118.
18 MITCHELL C, GENG Y, LI J. Using one-sided RDMA reads to build a fast, CPU-Efficient key-value store [C]// Proceedings of the 2013 USENIX Annual Technical Conference. San Jose: [s.n.], 2013: 103–114.
19 ZUO P, SUN J, YANG L, et al. One-sided {RDMA-Conscious} extendible hashing for disaggregated memory [C]// 2021 USENIX Annual Technical Conference (USENIX ATC 21). Portugal: [s.n.], 2021: 15-29.
20 GUO C, WU H, DENG Z, et al. RDMA over commodity ethernet at scale [C]// Proceedings of the 2016 ACM SIGCOMM Conference. Portugal: [s.n.], 2016: 202-215.
21 GAO Y, YANG Y, CHEN T, et al. DCQCN+: taming large-scale incast congestion in RDMA over ethernet networks [C]// 2018 IEEE 26th International Conference on Network Protocols (ICNP). [s.l.]: IEEE, 2018: 110-120.
22 YI B, XIA J, CHEN L, et al. Towards zero copy dataflows using RDMA [EB/OL]. [2022-06-01]. http://dl.acm.org/doi/abs/10.1145/3123878.3131975.
23 FENT P, VAN RENEN A, KIPF A, et al. Low-latency communication for fast DBMS using RDMA and shared memory [C]// 2020 IEEE 36th International Conference on Data Engineering (ICDE). [s.l.]: IEEE, 2020: 1477-1488.
24 WANG Y, LIU K, TIAN C, et al. Error recovery of RDMA packets in data center networks [C]// 28th International Conference on Computer Communication and Networks (ICCCN). [s.l.]: IEEE, 2019: 1-8.
25 ZAMANIAN E, YU X, STONEBRAKER M, et al Rethinking database high availability with RDMA networks[J]. Proceedings of the VLDB Endowment, 2019, 12 (11): 1637- 1650
doi: 10.14778/3342263.3342639
26 WU H, CHEN K, WU Y W, ZHENG W M Research on the consensus of big data systems based on RDMA and NVM[J]. Big Data Research, 2019, 5 (4): 89- 99
27 ANDERSON T E, CANINI M, KIM J, et al. Assise: performance and availability via client-local {NVM} in a distributed file system [C]// 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). Portugal: [s. n. ], 2020: 1011-1027.
28 CAPPELLETTI P, ANNUNZIATA R, ARNAUD F, et al Phase change memory for automotive grade embedded NVM applications[J]. Journal of Physics D: Applied Physics, 2020, 53 (19): 193002
doi: 10.1088/1361-6463/ab71aa
29 ZUO P, SUN J, YANG L, et al. One-sided RDMA-conscious extendible hashing for disaggregated memory [EB/OL]. [2022-06-01]. https://www.ccf.org.cn/ccfdl/ccf_dl_focus/computer-architecture/volume4/zllb1/2022-07-26/766849.shtml.
30 CHEN A A review of emerging non-volatile memory (NVM) technologies and applications[J]. Solid-State Electronics, 2016, 125 (11): 25- 38
31 CAO W, LIU Z, WANG P, et al PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database[J]. Proceedings of the VLDB Endowment, 2018, 11 (12): 1849- 1862
doi: 10.14778/3229863.3229872
32 ZHANG Z, HU H, YU Y, et al. Dependency preserved raft for transactions [C]// International Conference on Database Systems for Advanced Applications. Cham: Springer, 2020: 228-245.
[1] 修思文, 黄凯, 余慜, 谢天艺, 葛海通, 严晓浪. 面向非写分配高速缓存的一致性协议及实现[J]. 浙江大学学报(工学版), 2015, 49(2): 351-359.
[2] 修思文, 黄凯, 余慜, 谢天艺, 葛海通, 严晓浪. 面向非写分配高速缓存的一致性协议及实现[J]. 浙江大学学报(工学版), 2014, 48(9): 1-9.