Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2023, Vol. 57 Issue (2): 320-329    DOI: 10.3785/j.issn.1008-973X.2023.02.012
    
Distributed consistency protocol supporting log submission out-of-order
Jin WANG(),Bo-han LI*(),Jia-jun WU,Xin-yang SONG
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
Download: HTML     PDF(1551KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A Raft protocol variant, parallel commit Raft (PC-Raft), was proposed, in order to solve the strict serialization design of log submission in Raft algorithm. In PC-Raft, pipeline was used in the log submission stage and the log buffer was redesigned to realize out-of-order submission of logs. The RDMA network was used as the transmission method to improve the concurrency of log transmission and speed up the transmission. In the log execution stage, batch processing was used to package multiple instructions and send them to the state machine for execution one by one. For the ghost recurrence problem that occurs, the combination of LSN and term number was used to ensure the security of the log. In order to solve the problem of log holes caused by disordered log submission, the leader election algorithm was redesigned, and temporary leaders were added to the election to ensure that the elected leaders can recover the logs as quickly as possible. Test results show that PC-Raft has a significant performance improvement in throughput compared with Raft, and the delay is lower. In the case of frequent log instruction dependency, the throughput is higher and the latency is lower compared with the existing Raft based variant.



Key wordsconsistency protocol      Paxos      Raft      PC-Raft      out-of-order submission     
Received: 31 July 2022      Published: 28 February 2023
CLC:  TP 311  
Fund:  CCF-华为数据库系统创新研究计划资助项目(CCF HUAWEIDBIR2020001A)
Corresponding Authors: Bo-han LI     E-mail: 1598076545@qq.com;bhli@nuaa.edu.cn
Cite this article:

Jin WANG,Bo-han LI,Jia-jun WU,Xin-yang SONG. Distributed consistency protocol supporting log submission out-of-order. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 320-329.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2023.02.012     OR     https://www.zjujournals.com/eng/Y2023/V57/I2/320


支持日志乱序提交的分布式一致性协议

为了解决Raft算法中日志提交的严格串行化设计,提出一种Raft协议变体:并行提交Raft (PC-Raft). PC-Raft在日志提交阶段运用流水线,重新设计日志缓冲区,实现日志的乱序提交. 传输方式使用RDMA网络,在提高日志传输的并发性的同时加快传输速度. 在日志执行阶段,采用批处理,将多条指令打包发送给状态机逐条执行. 针对日志并行提交情况下会出现的幽灵复现问题,采用LSN与任期号结合的方式保证日志的安全性. 针对日志乱序提交会出现的日志空洞问题,重新设计领导者选举算法,在选举中加入临时领导者,保证选举出的领导者能最快恢复日志. 测试结果证明PC-Raft对比Raft在吞吐量方面有着明显的性能提升,同时延迟更低,并且在日志指令依赖频繁的情况下,吞吐量比现有基于Raft的变体更高,延迟也更低.


关键词: 一致性协议,  Paxos,  Raft,  PC-Raft,  乱序提交 
Fig.1 Log replication process of Raft protocol
Fig.2 Pipeline parallel submission log
Fig.3 Batch packaging execution
轮次 Leader A B C
Round 1 A 1~8 1~3 1~3
Round 2 B Down 1~5,10 1~5,10
Round 3 C 1~10 1~10 1~10
Tab.1 Example of ghost reappearance
Fig.4 Railing settings for Raft protocol
Fig.5 Design of log buffer structure
Fig.6 Process of log copy of PC-Raft
Fig.7 Transition of node state
Fig.8 Process of Leader election
Fig.9 Log recovery process of temporary Leader
Fig.10 Throughput under different numbers of clients
Fig.11 Impact of number of clients on latency
Fig.12 Throughput under load skew
Fig.13 Impact of data skew on throughput
Fig.14 Impact of replicas number on throughput
Fig.15 Impact of replicas number on latency
[1]   KEMME B, ALONSO G. Don't be lazy, be consistent: Postgres-R, a new way to implement database replication [C]// VLDB. Portugal: [s.n.], 2000: 134-143.
[2]   WIESMANN M, PEDONE F, SCHIPER A, et al. Database replication techniques: a three parameter classification [C]// Proceedings of 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000. [s.l.]: IEEE, 2000: 206-215.
[3]   STONEBRAKER M Concurrency control and consistency of multiple copies of data in distributed INGRES[J]. IEEE Transactions on Software Engineering, 1979, 5 (3): 188- 194
[4]   SCHNEIDER F B Implementing fault-tolerant services using the state machine approach: a tutorial[J]. ACM Computing Surveys (CSUR), 1990, 22 (4): 299- 319
doi: 10.1145/98163.98167
[5]   LAMPORT L. The part-time parliament [M]// Concurrency: the Works of Leslie Lamport, 2019: 277-317.
[6]   KIRSCH J, AMIR Y. Paxos for system builders: an overview [C]// Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware. Portugal: [s.n.], 2008: 1-6.
[7]   CHARAPKO A, AILIJIANG A, DEMIRBAS M. Linearizable quorum reads in Paxos [C]// 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). Portugal: [s.n.], 2019.
[8]   ZHENG J, LIN Q, XU J, et al PaxosStore: high-availability storage made practical in WeChat[J]. Proceedings of the VLDB Endowment, 2017, 10 (12): 1730- 1741
doi: 10.14778/3137765.3137778
[9]   ONGARO D, OUSTERHOUT J. In search of an understandable consensus algorithm [C]// 2014 USENIX Annual Technical Conference (Usenix ATC 14). Portugal: [s. n. ], 2014: 305-319.
[10]   GRAY J N. Notes on data base operating systems [M]// Operating systems. Berlin, Heidelberg: Springer, 1978: 393-481.
[11]   SKEEN D. Nonblocking commit protocols [C]// Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data. Portugal: [s.n.], 1981: 133-142.
[12]   GIFFORD D K. Weighted voting for replicated data [C]// Proceedings of the 7th ACM Symposium on Operating Systems Principles. Portugal: [s.n.], 1979: 150-162.
[13]   LAMPORT L Fast paxos[J]. Distributed Computing, 2006, 19 (2): 79- 103
doi: 10.1007/s00446-006-0005-x
[14]   CHANDRA T D, GRIESEMER R, REDSTONE J. Paxos made live: an engineering perspective [C]// Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. Portugal: [s.n.], 2007: 398-407.
[15]   HUNT P, KONAR M, JUNQUEIRA F P, et al. ZooKeeper: wait-free coordination for internet-scale systems [C]// 2010 USENIX Annual Technical Conference (USENIX ATC 10). Portugal: [s.n.], 2010.
[16]   MORARU I, ANDERSEN D G, KAMINSKY M. There is more consensus in egalitarian parliaments [C]// Proceedings of the 24th ACM Symposium on Operating Systems Principles. Portugal: [s.n.], 2013: 358-372.
[17]   POKE M, HOEFLER T. Dare: high-performance state machine replication on RDMA networks [C]// Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). Portugal: [s.n.], 2015: 107-118.
[18]   MITCHELL C, GENG Y, LI J. Using one-sided RDMA reads to build a fast, CPU-Efficient key-value store [C]// Proceedings of the 2013 USENIX Annual Technical Conference. San Jose: [s.n.], 2013: 103–114.
[19]   ZUO P, SUN J, YANG L, et al. One-sided {RDMA-Conscious} extendible hashing for disaggregated memory [C]// 2021 USENIX Annual Technical Conference (USENIX ATC 21). Portugal: [s.n.], 2021: 15-29.
[20]   GUO C, WU H, DENG Z, et al. RDMA over commodity ethernet at scale [C]// Proceedings of the 2016 ACM SIGCOMM Conference. Portugal: [s.n.], 2016: 202-215.
[21]   GAO Y, YANG Y, CHEN T, et al. DCQCN+: taming large-scale incast congestion in RDMA over ethernet networks [C]// 2018 IEEE 26th International Conference on Network Protocols (ICNP). [s.l.]: IEEE, 2018: 110-120.
[22]   YI B, XIA J, CHEN L, et al. Towards zero copy dataflows using RDMA [EB/OL]. [2022-06-01]. http://dl.acm.org/doi/abs/10.1145/3123878.3131975.
[23]   FENT P, VAN RENEN A, KIPF A, et al. Low-latency communication for fast DBMS using RDMA and shared memory [C]// 2020 IEEE 36th International Conference on Data Engineering (ICDE). [s.l.]: IEEE, 2020: 1477-1488.
[24]   WANG Y, LIU K, TIAN C, et al. Error recovery of RDMA packets in data center networks [C]// 28th International Conference on Computer Communication and Networks (ICCCN). [s.l.]: IEEE, 2019: 1-8.
[25]   ZAMANIAN E, YU X, STONEBRAKER M, et al Rethinking database high availability with RDMA networks[J]. Proceedings of the VLDB Endowment, 2019, 12 (11): 1637- 1650
doi: 10.14778/3342263.3342639
[26]   WU H, CHEN K, WU Y W, ZHENG W M Research on the consensus of big data systems based on RDMA and NVM[J]. Big Data Research, 2019, 5 (4): 89- 99
[27]   ANDERSON T E, CANINI M, KIM J, et al. Assise: performance and availability via client-local {NVM} in a distributed file system [C]// 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). Portugal: [s. n. ], 2020: 1011-1027.
[28]   CAPPELLETTI P, ANNUNZIATA R, ARNAUD F, et al Phase change memory for automotive grade embedded NVM applications[J]. Journal of Physics D: Applied Physics, 2020, 53 (19): 193002
doi: 10.1088/1361-6463/ab71aa
[29]   ZUO P, SUN J, YANG L, et al. One-sided RDMA-conscious extendible hashing for disaggregated memory [EB/OL]. [2022-06-01]. https://www.ccf.org.cn/ccfdl/ccf_dl_focus/computer-architecture/volume4/zllb1/2022-07-26/766849.shtml.
[30]   CHEN A A review of emerging non-volatile memory (NVM) technologies and applications[J]. Solid-State Electronics, 2016, 125 (11): 25- 38
[31]   CAO W, LIU Z, WANG P, et al PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database[J]. Proceedings of the VLDB Endowment, 2018, 11 (12): 1849- 1862
doi: 10.14778/3229863.3229872
[32]   ZHANG Z, HU H, YU Y, et al. Dependency preserved raft for transactions [C]// International Conference on Database Systems for Advanced Applications. Cham: Springer, 2020: 228-245.
[1] Yan-hao CHEN,Sui-huai YU,Jian-jie CHU,Wen-zhe CUN. Evaluation of aircraft cockpit form based on hesitant fuzzy sets[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(8): 1568-1577.
[2] Ke-wen ZHANG,Bai-song PAN. Control design of spacecraft autonomous rendezvous using nonlinear models with uncertainty[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 833-842.
[3] Shou-guo ZHENG,Yong-de ZHANG,Wen-tian XIE,Hu FAN,Qing WANG. Aircraft final assembly line modeling based on digital twin[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 843-854.
[4] Jie LIU,Xian-zhou DONG,Wei HAN,Xin-wei WANG,Chun LIU,Jun JIA. Trajectory planning for carrier aircraft on deck using Newton Symplectic pseudo-spectral method[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1827-1838.
[5] HUANG Zi-liang, OUYANG Xiao-ping, ZHAO Tian-fei, ZHANG Jian-bo, ZHOU Liang, YANG Hua-yong. Characteristic of air content detection system for aircraft hydraulics[J]. Journal of ZheJiang University (Engineering Science), 2019, 53(1): 158-165.
[6] ZHENG Shou-guo, CUI Yan-min, WANG Qing, YANG Fei, CHENG Liang. Design of field data acquisition platform for aircraft assembly[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(8): 1526-1534.
[7] ZHOU Min, ZHENG Guo-lei, ZHENG Zu-jie. Automatic recognition method based on fuzzy inference for planar-top rib in aircraft structural parts[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(3): 591-598.
[8] OUYANG Xiao-ping, LIU Yu-long, XUE Zhi-quan, GUO Sheng-rong, ZHOU Qing-he, YANG Hua-yong. null[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(7): 1361-1367.
[9] OUYANG Xiao-ping, ZHAO Tian-fei, LI Feng, YANG Shang-bao, ZHU Ying, YANG Hua-yong. Integral variable PI control on flow load simulator of aircraft hydraulic system[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(6): 1111-1118.
[10] WANG Qing, YU Xiao guang, Qiao Ming jie, ZHAO An an, CHENG Liang, LI Jiang xiong, KE Ying lin. Rapid calibration based on SQP algorithm for coordinate frame of localizers[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(2): 319-327.
[11] WANG Qing, WEN Li-qing, LI Jiang-xiong, KE Ying-lin, LI Tao, ZHANG Shi-jiong. Modeling and optimization for aircraft final assembly line based on Petri net[J]. Journal of ZheJiang University (Engineering Science), 2015, 49(7): 1224-1231.
[12] DOU Ya-dong, WANG Qing, LI Jiang-xiong, KE Ying-lin. Data integration for aircraft digital assembly system[J]. Journal of ZheJiang University (Engineering Science), 2015, 49(5): 858-865.
[13] WANG Qing, LIANG Qin, LI Jiang-xiong, KE Ying-lin. Estimation and alignment method of wing position and orientation for aircraft digital assembly[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(7): 1287-1294.
[14] WEI Tao, HU Hai-bing, LU Dao-rong, WU Hong-fei.
Non-isolated DC/DC converters capable of continuous input and output current flows
[J]. Journal of ZheJiang University (Engineering Science), 2014, 48(10): 1901-1908.
[15] CAI Yuan-qiang,LIU Xin-feng,GUO Lin,SUN Hong-lei,CAO Zhi-gang. Long-term settlement of surcharge preloading foundation in soft clay area induced by aircraft loads[J]. Journal of ZheJiang University (Engineering Science), 2013, 47(7): 1157-1163.