Gradient sparsification compression approach to reducing communication in distributed training
Shi-da CHEN1,2(),Qiang LIU1,2,*(),Liang HAN3
1. School of Microelectronics, Tianjin University, Tianjin 300072, China 2. Tianjin Key Laboratory of Imaging and Sensing Microelectronic Technology, Tianjin 300072, China 3. Alibaba Group, Sunnyvale 94085, USA
The existing gradient sparsification compression technology still has the problem of large time consumption in practical applications. To solve this problem, a low-complex and high-speed approach based on the residual gradient compression algorithm in distributed training was proposed, to select the communication-set of the top-k sparse gradient. Firstly, the Wasserstein distance was used to determine that the characteristics of the gradient distribution conformed to the Laplacian distribution. Secondly, the key points were determined by the area relationship of the Laplacian distribution curve, and the feature parameters were simplified by maximum likelihood estimation. Finally, the sparse gradient top-k threshold was estimated and corrected by the binary search algorithm. The proposed approach avoided the instability of random sampling methods and some complex operations like data sorting. The CIFAR-10 and CIFAR-100 datasets were used to train the deep neural network for image classification on GPU platform in order to evaluate the effectiveness of the proposed approach. Results show that this approach accelerated the training process up to 1.62 and 1.3 times, compared with the radixSelect and the hierarchical selection methods under the same training accuracy.
Shi-da CHEN,Qiang LIU,Liang HAN. Gradient sparsification compression approach to reducing communication in distributed training. Journal of ZheJiang University (Engineering Science), 2021, 55(2): 386-394.
Tab.1Wasserstein distance of two distribution methods in different networks
Fig.1Histogram of residual gradient statistics in ResNet50
Fig.2Diagram of area feature in Laplacian distribution curve
Fig.3Diagram of modified LDTE based on binary search
Fig.4Parallel processing of computing and communication for DNN distributed training
Fig.5Effects of five approaches for sparse gradient set selection under different data sizes
Fig.6Comparison of model learning curve and top-1 accuracy under different strategies
数据集
网络模型
V/MB
训练精度/%
Baseline
radixSelect
DGC层级top-k
RGC 剪枝top-k
RGC LDTE-BS
CIFAR-10
ResNet101
162.17
93.70
93.21 (?0.49)
93.28 (?0.42)
93.23 (?0.47)
93.42 (?0.28)
CIFAR-10
DenseNet169
47.66
94.04
93.32 (?0.72)
93.29 (?0.75)
93.32 (?0.72)
93.56 (?0.48)
CIFAR-100
ResNet50
89.72
74.78
72.69 (?2.09)
72.49 (?2.29)
72.71 (?2.07)
73.11 (?1.67)
CIFAR-100
DenseNet121
26.54
75.41
73.25 (?2.16)
73.14 (?2.27)
73.17 (?2.24)
73.85 (?1.56)
Tab.2Comparison of training accuracy of models under different strategies
网络模型
α
radixSelect
层级选择
剪枝
LDTE-BS
ResNet101
1.00
1.54
1.53
1.62
DenseNet169
1.00
0.98
1.08
1.18
ResNet50
1.00
1.48
1.53
1.55
DenseNet121
1.00
0.84
1.01
1.12
Tab.3Speedup of computation time for models with same accuracy under different strategies
Fig.7Comparison of scalability between various sparsification and classic training approaches on K40C GPU
Fig.8Comparison of scalability between various sparsification and classic training approaches on V100 GPU
[1]
TANENBAUM A S, VAN STEEN M. Distributed systems: principles and paradigms [M]. New York: Prentice-Hall, 2007: 17-24.
[2]
DEAN J, CORRADO G, MONGA R, et al. Large scale distributed deep networks [C]// Proceedings of the 25th International Conference on Neural Information Processing Systems: Volume 1. Lake Tahoe: Curran Associates, 2012: 1223-1231.
[3]
XU H, HO C Y, ABDELMONIEM A M, et al. Compressed communication for distributed deep learning: survey and quantitative evaluation [EB/OL]. [2020-4-13]. https: //repository. kaust. edu. sa/handle/10754/662495.
[4]
FANG J, FU H, YANG G, et al RedSync: reducing synchronization bandwidth for distributed deep learning training system[J]. Journal of Parallel and Distributed Computing, 2019, 133: 30- 39
doi: 10.1016/j.jpdc.2019.05.016
[5]
CHEN C Y, CHOI J, BRAND D, et al. Adacomp: adaptive residual gradient compression for data-parallel distributed training [C]// 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI press, 2018: 2827-2835.
[6]
AJI A F, HEAFIELD K. Sparse communication for distributed gradient descent [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics (ACL), 2017: 440–445.
[7]
LIN Y, HAN S, MAO H, et al. Deep gradient compression: reducing the communication bandwidth for distributed training [EB/OL]. [2017-12-5]. https: //arxiv. org/abs/1712.01887.
[8]
SUN H, SHAO Y, JIANG J, et al. Sparse gradient compression for distributed SGD [C]// International Conference on Database Systems for Advanced Applications. Chiang Mai: Springer, 2019: 139-155.
[9]
SATTLER F, WIEDEMANN S, MüLLER K R, et al. Sparse binary compression: towards distributed deep learning with minimal communication [C]// 2019 International Joint Conference on Neural Networks (IJCNN). Budapest: IEEE, 2019: 1-8.
[10]
STICH S U, CORDONNIER J B, JAGGI M. Sparsified SGD with memory [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal: Curran Associates, 2018: 4452-4463.
[11]
ALISTARH D, HOEFLER T, JOHANSSON M, et al. The convergence of sparsified gradient methods [C]// Proceedings of the Thirty-second International Conference on Neural Information Processing Systems. Montreal: Curran Associates, 2018: 5977-5987.
[12]
DUTTA A, BERGOU E H, ABDELMONIEM A M, et al. On the discrepancy between the theoretical analysis and practical implementations of compressed communication for distributed deep learning [EB/OL]. [2019-11-19]. https: //arxiv. org/abs/1911.08250.
[13]
STROM N. Scalable distributed DNN training using commodity GPU cloud computing [C]// 16th Annual Conference of the International Speech Communication Association. Dresden: International Speech Communication Association, 2015: 1488-1492.
[14]
ALABI T, BLANCHARD J D, GORDON B, et al Fast k-selection algorithms for graphics processing units[J]. Journal of Experimental Algorithmics, 2012, 17 (4): 1- 29
[15]
WEN W, XU C, YAN F, et al. Terngrad: ternary gradients to reduce communication in distributed deep learning [C]// Proceedings of the Thirty-first International Conference on Neural Information Processing Systems. Long Beach: Curran Associates, 2017: 1508-1518.
[16]
BERNSTEIN J, WANG Y X, AZIZZADENESHELI K, et al. signSGD: compressed optimisation for non-convex problems [C]// Proceedings of the International Conference on Machine Learning, Stockholm: International Machine Learning Society , 2018: 894-918.
[17]
HE L, ZHENG S, CHEN W, et al OptQuant: distributed training of neural networks with optimized quantization mechanisms[J]. Neurocomputing, 2019, 340: 233- 244
doi: 10.1016/j.neucom.2019.02.049