|
|
Distributed consistency protocol supporting log submission out-of-order |
Jin WANG( ),Bo-han LI*( ),Jia-jun WU,Xin-yang SONG |
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China |
|
|
Abstract A Raft protocol variant, parallel commit Raft (PC-Raft), was proposed, in order to solve the strict serialization design of log submission in Raft algorithm. In PC-Raft, pipeline was used in the log submission stage and the log buffer was redesigned to realize out-of-order submission of logs. The RDMA network was used as the transmission method to improve the concurrency of log transmission and speed up the transmission. In the log execution stage, batch processing was used to package multiple instructions and send them to the state machine for execution one by one. For the ghost recurrence problem that occurs, the combination of LSN and term number was used to ensure the security of the log. In order to solve the problem of log holes caused by disordered log submission, the leader election algorithm was redesigned, and temporary leaders were added to the election to ensure that the elected leaders can recover the logs as quickly as possible. Test results show that PC-Raft has a significant performance improvement in throughput compared with Raft, and the delay is lower. In the case of frequent log instruction dependency, the throughput is higher and the latency is lower compared with the existing Raft based variant.
|
Received: 31 July 2022
Published: 28 February 2023
|
|
Fund: CCF-华为数据库系统创新研究计划资助项目(CCF HUAWEIDBIR2020001A) |
Corresponding Authors:
Bo-han LI
E-mail: 1598076545@qq.com;bhli@nuaa.edu.cn
|
支持日志乱序提交的分布式一致性协议
为了解决Raft算法中日志提交的严格串行化设计,提出一种Raft协议变体:并行提交Raft (PC-Raft). PC-Raft在日志提交阶段运用流水线,重新设计日志缓冲区,实现日志的乱序提交. 传输方式使用RDMA网络,在提高日志传输的并发性的同时加快传输速度. 在日志执行阶段,采用批处理,将多条指令打包发送给状态机逐条执行. 针对日志并行提交情况下会出现的幽灵复现问题,采用LSN与任期号结合的方式保证日志的安全性. 针对日志乱序提交会出现的日志空洞问题,重新设计领导者选举算法,在选举中加入临时领导者,保证选举出的领导者能最快恢复日志. 测试结果证明PC-Raft对比Raft在吞吐量方面有着明显的性能提升,同时延迟更低,并且在日志指令依赖频繁的情况下,吞吐量比现有基于Raft的变体更高,延迟也更低.
关键词:
一致性协议,
Paxos,
Raft,
PC-Raft,
乱序提交
|
|
[1] |
KEMME B, ALONSO G. Don't be lazy, be consistent: Postgres-R, a new way to implement database replication [C]// VLDB. Portugal: [s.n.], 2000: 134-143.
|
|
|
[2] |
WIESMANN M, PEDONE F, SCHIPER A, et al. Database replication techniques: a three parameter classification [C]// Proceedings of 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000. [s.l.]: IEEE, 2000: 206-215.
|
|
|
[3] |
STONEBRAKER M Concurrency control and consistency of multiple copies of data in distributed INGRES[J]. IEEE Transactions on Software Engineering, 1979, 5 (3): 188- 194
|
|
|
[4] |
SCHNEIDER F B Implementing fault-tolerant services using the state machine approach: a tutorial[J]. ACM Computing Surveys (CSUR), 1990, 22 (4): 299- 319
doi: 10.1145/98163.98167
|
|
|
[5] |
LAMPORT L. The part-time parliament [M]// Concurrency: the Works of Leslie Lamport, 2019: 277-317.
|
|
|
[6] |
KIRSCH J, AMIR Y. Paxos for system builders: an overview [C]// Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware. Portugal: [s.n.], 2008: 1-6.
|
|
|
[7] |
CHARAPKO A, AILIJIANG A, DEMIRBAS M. Linearizable quorum reads in Paxos [C]// 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). Portugal: [s.n.], 2019.
|
|
|
[8] |
ZHENG J, LIN Q, XU J, et al PaxosStore: high-availability storage made practical in WeChat[J]. Proceedings of the VLDB Endowment, 2017, 10 (12): 1730- 1741
doi: 10.14778/3137765.3137778
|
|
|
[9] |
ONGARO D, OUSTERHOUT J. In search of an understandable consensus algorithm [C]// 2014 USENIX Annual Technical Conference (Usenix ATC 14). Portugal: [s. n. ], 2014: 305-319.
|
|
|
[10] |
GRAY J N. Notes on data base operating systems [M]// Operating systems. Berlin, Heidelberg: Springer, 1978: 393-481.
|
|
|
[11] |
SKEEN D. Nonblocking commit protocols [C]// Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data. Portugal: [s.n.], 1981: 133-142.
|
|
|
[12] |
GIFFORD D K. Weighted voting for replicated data [C]// Proceedings of the 7th ACM Symposium on Operating Systems Principles. Portugal: [s.n.], 1979: 150-162.
|
|
|
[13] |
LAMPORT L Fast paxos[J]. Distributed Computing, 2006, 19 (2): 79- 103
doi: 10.1007/s00446-006-0005-x
|
|
|
[14] |
CHANDRA T D, GRIESEMER R, REDSTONE J. Paxos made live: an engineering perspective [C]// Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. Portugal: [s.n.], 2007: 398-407.
|
|
|
[15] |
HUNT P, KONAR M, JUNQUEIRA F P, et al. ZooKeeper: wait-free coordination for internet-scale systems [C]// 2010 USENIX Annual Technical Conference (USENIX ATC 10). Portugal: [s.n.], 2010.
|
|
|
[16] |
MORARU I, ANDERSEN D G, KAMINSKY M. There is more consensus in egalitarian parliaments [C]// Proceedings of the 24th ACM Symposium on Operating Systems Principles. Portugal: [s.n.], 2013: 358-372.
|
|
|
[17] |
POKE M, HOEFLER T. Dare: high-performance state machine replication on RDMA networks [C]// Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing (HPDC). Portugal: [s.n.], 2015: 107-118.
|
|
|
[18] |
MITCHELL C, GENG Y, LI J. Using one-sided RDMA reads to build a fast, CPU-Efficient key-value store [C]// Proceedings of the 2013 USENIX Annual Technical Conference. San Jose: [s.n.], 2013: 103–114.
|
|
|
[19] |
ZUO P, SUN J, YANG L, et al. One-sided {RDMA-Conscious} extendible hashing for disaggregated memory [C]// 2021 USENIX Annual Technical Conference (USENIX ATC 21). Portugal: [s.n.], 2021: 15-29.
|
|
|
[20] |
GUO C, WU H, DENG Z, et al. RDMA over commodity ethernet at scale [C]// Proceedings of the 2016 ACM SIGCOMM Conference. Portugal: [s.n.], 2016: 202-215.
|
|
|
[21] |
GAO Y, YANG Y, CHEN T, et al. DCQCN+: taming large-scale incast congestion in RDMA over ethernet networks [C]// 2018 IEEE 26th International Conference on Network Protocols (ICNP). [s.l.]: IEEE, 2018: 110-120.
|
|
|
[22] |
YI B, XIA J, CHEN L, et al. Towards zero copy dataflows using RDMA [EB/OL]. [2022-06-01]. http://dl.acm.org/doi/abs/10.1145/3123878.3131975.
|
|
|
[23] |
FENT P, VAN RENEN A, KIPF A, et al. Low-latency communication for fast DBMS using RDMA and shared memory [C]// 2020 IEEE 36th International Conference on Data Engineering (ICDE). [s.l.]: IEEE, 2020: 1477-1488.
|
|
|
[24] |
WANG Y, LIU K, TIAN C, et al. Error recovery of RDMA packets in data center networks [C]// 28th International Conference on Computer Communication and Networks (ICCCN). [s.l.]: IEEE, 2019: 1-8.
|
|
|
[25] |
ZAMANIAN E, YU X, STONEBRAKER M, et al Rethinking database high availability with RDMA networks[J]. Proceedings of the VLDB Endowment, 2019, 12 (11): 1637- 1650
doi: 10.14778/3342263.3342639
|
|
|
[26] |
WU H, CHEN K, WU Y W, ZHENG W M Research on the consensus of big data systems based on RDMA and NVM[J]. Big Data Research, 2019, 5 (4): 89- 99
|
|
|
[27] |
ANDERSON T E, CANINI M, KIM J, et al. Assise: performance and availability via client-local {NVM} in a distributed file system [C]// 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). Portugal: [s. n. ], 2020: 1011-1027.
|
|
|
[28] |
CAPPELLETTI P, ANNUNZIATA R, ARNAUD F, et al Phase change memory for automotive grade embedded NVM applications[J]. Journal of Physics D: Applied Physics, 2020, 53 (19): 193002
doi: 10.1088/1361-6463/ab71aa
|
|
|
[29] |
ZUO P, SUN J, YANG L, et al. One-sided RDMA-conscious extendible hashing for disaggregated memory [EB/OL]. [2022-06-01]. https://www.ccf.org.cn/ccfdl/ccf_dl_focus/computer-architecture/volume4/zllb1/2022-07-26/766849.shtml.
|
|
|
[30] |
CHEN A A review of emerging non-volatile memory (NVM) technologies and applications[J]. Solid-State Electronics, 2016, 125 (11): 25- 38
|
|
|
[31] |
CAO W, LIU Z, WANG P, et al PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database[J]. Proceedings of the VLDB Endowment, 2018, 11 (12): 1849- 1862
doi: 10.14778/3229863.3229872
|
|
|
[32] |
ZHANG Z, HU H, YU Y, et al. Dependency preserved raft for transactions [C]// International Conference on Database Systems for Advanced Applications. Cham: Springer, 2020: 228-245.
|
|
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|