Please wait a minute...
JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE)  2018, Vol. 52 Issue (3): 515-523    DOI: 10.3785/j.issn.1008-973X.2018.03.013
Computer and Communication Technology     
Parallel computing method for two-dimensional matrix convolution
ZHANG Jun-yang, GUO Yang, HU Xiao
College of Computer, National University of Defense Technology, Changsha 410073, China
Download:   PDF(4089KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A parallel implementation method based on multi-core vector processor FT2000 was proposed to improve the computational efficiency of two-dimensional matrix convolution in convolution neural network model. The convolution kernel element was broadcast to vector register by using broadcast instruction; the row elements of the convolution matrix were vector loaded. With shuffle operation,the operation of matrix convolution, which is hard to be parallelled, can be vectorized by using multiply-add operation, and the implementation efficiency was achieved through reduction of access, full reuse of obtained data. Two kinds of common matrix convolution methods were designed:changing convolution matrix scale with constant convolution kernel size, and constant convolution matrix size with changing convolution kernel scale. The influence of different calculation methods on the algorithm execution efficiency was analyzed and compared. Finally, the comparison experiments were taken based on the server-level multi-core CPU and TI6678. Results show that FT2000 has a better computing advantage over multi-core CPU and TI6678, which can accelerate up to 11 974 times compared to multi-core CPU, while to TI6678 it is 21 times.



Received: 04 March 2017      Published: 11 September 2018
CLC:  TP391  
Cite this article:

ZHANG Jun-yang, GUO Yang, HU Xiao. Parallel computing method for two-dimensional matrix convolution. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(3): 515-523.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2018.03.013     OR     http://www.zjujournals.com/eng/Y2018/V52/I3/515


二维矩阵卷积的并行计算方法

为了提高卷积神经网络模型中二维矩阵卷积的计算效率,基于FT2000多核向量处理器研究二维矩阵卷积的并行实现方法.通过使用广播指令将卷积核元素广播至向量寄存器,使用向量LOAD指令加载卷积矩阵行元素,并通过混洗操作将不易并行化的矩阵卷积操作变成可以向量化的乘加操作,实现了通过减少访存、充分复用已取数据的方式来提高算法的执行效率.设计卷积矩阵规模变化、卷积核规模不变和卷积矩阵规模不变、卷积核规模变化2种常用矩阵卷积计算方式,并对比分析不同计算方式对算法执行效率的影响.基于服务器级多核CPU和TI6678进行实验对比,实验结果显示,FT2000比多核CPU及TI6678具有更好的计算优势,相比多核CPU最高可加速11 974倍,相比TI6678可加速21倍.

[1] DENG L, YU D. Deep learning:methods and applications[J]. Foundations & Trends© in Signal Processing, 2014, 7(3):197-387.
[2] WU M,CHEN L. Image recognition based on deep learning[C]//Chinese Automation Congress. Wuhan:IEEE,2015.
[3] LI D, LI J Y, HUANG J T,et al. Recent advances in deep learning for speech research at Microsoft[C]//In the Proceedings of the 2013 IEEE International Conference on Acoustics,Speech and Signal Processing. Vancouver:IEEE,2013:8604-8608.
[4] KAVUKCUOGLU K, BOUREAU Y L, BOUREAU Y L, et al. Learning convolutional feature hierarchies for visual recognition[C]//International Conference on Neural Information Processing Systems.Vancouver:Curran Associates Inc. 2010:1090-1098.
[5] CHEN Z,WANG J,HE H,et al. A fast deep learning system using GPU[C]//Proceedings of International Symposium on Circuits and Systems. Melbourne:IEEE,2014:1552-1555.
[6] BOURLARD H,KAMP Y. Auto-association by multilayer-perceptrons and singular value decomposition[J]. Biological Cybernetics,1988,59(4/5):291-294.
[7] YAJIE MIAO,MOHAMMAD GOWAYYED,AND FLORI-AN METZE. EESEN:End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//Automatic Speech Recognition and Understanding. Scottsdale:IEEE, 2015:167-174.
[8] LIU S,DU Z,TAO J,et al. Cambricon:an instruction-set architecture for neural networks[J]. ACM Sigarch Co-mputer Architecture News,2016,44(3):393-405.
[9] NASSE F, THURAU C, FINK G A. Face detection using GPU-based convolutional neural networks[C]//International Conference on Computer Analysis of Images and Patterns. Berlin Heidelberg:Springer, 2009:83-90.
[10] POTLURI S,FASIH A,VUTUKURU L K,et al. CNN-based high performance computing for real time image-processing on GPU[C]//The Workshop on Nonlinear Dynamics & Synchronization & Int'l Symposium on Theoretical Electrical Engineering. Klagenfurt:IEEE, 2011:1-7.
[11] YU Q,WANG C,MA X,et al. A deep learning predic-tion process accelerator based FPGA[J]. Proceedings of the Annual ACM Symposium on Theory of Computing,2015:585-594.
[12] HEGDE G,SIDDHARTHA,RAMASAMY N,et al. Evaluating embedded FPGA accelerators for deep learning applications[C]//IEEE,International Symposium on Field Programmable Custom Computing Machines. Washington DC.:IEEE,2016:25.
[13] CHEN T,DU Z,SUN N,et al. DianNao:a small-footprint high throughput accelerator for ubiquitous machine learning[J]. ACM Sigarch Computer Architecture News,2014,49(4):269-284.
[14] LIU D,CHEN T,LIU S,et al. PuDianNao:a polyvalent machine learning accelerator[C]//Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. Istanbul:ACM,2015:369-381.
[15] DU Z. ShiDianNao:shifting vision processing closer to the sensor[C]//ISCA'15 Proceedings of the,International Symposium on Computer Architecture. Portland:ISCA,2015:92-104.
[16] 刘仲,田希,陈磊.支持原位计算的高效三角矩阵乘法向量化方法[J].国防科技大学学报,2014,6(36):7-11. LIU Zhong,TIAN Xi,CHEN Lei. Efficient vectorization method of triangular matrix multiplication supporting in-place calculation[J]. Journal of National University of Defense Technology,2014,6(36):7-11.
[17] 刘仲,陈跃跃,陈海燕.支持任意系数长度和数据类型的FIR滤波器向量化方法[J].电子学报,2013,2(41):346-351. LIU Zhong,CHEN Yue-yue,CHEN Hai-yan. A vectorization of fir filter supporting arbitrary coefficients length and data types[J]. Acta Electronica Sinica,2013,2(41):346-351.
[18] 周海芳,高畅,方民权.基于CUBLAS和CUDA的MNF并行算法设计与优化[J].湖南大学学报:自科版,2017,4(44):147-156. ZHOU Hai-fang,GAO Chang,FANG Min-quan,Parallel algorithm design and performance optimization of maximum noise fraction rotation based on CUBLAS and CUDA[J]. Journal of Hunan University:Natural Sciences,2017,4(44):147-156.
[19] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[J]. Computer Science, 2015:4013-4021.
[20] ZAGORUYKO S, KOMODAKIS N. Learning to compare image patches via convolutional neural networks[C]//Computer Vision and Pattern Recognition. Boston:IEEE, 2015:4353-4361.
[21] POTLURI S, FASIH A, VUTUKURU L K, et al. CNN based high performance computing for real time image processing on GPU[C]//Nonlinear Dynamics and Synchronization. Klagenfurt.:IEEE, 2011:1-7.
[22] CHELLAPILLA K, PURI S, SIMARD P. High performance convolutional neural networks for document processing[C]//Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule:Suvisoft, 2006:inria-00112631.
[23] DONGARRA J J. An extended set of FORTRAN basic linear algebra subprograms[J]. ACM Transactions on Mathematical Software,1988,14(1):18-32.

[1] HAN Yong, NING Lian-ju, ZHENG Xiao-lin, LIN Wei-hua, SUN Zhong-yuan. Matrix factorization recommendation based on social information and item exposure[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2019, 53(1): 89-98.
[2] ZHENG Zhou, ZHANG Xue-chang, ZHENG Si-ming, SHI Yue-ding. Liver segmentation in CT images based on region-growing and unified level set method[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(12): 2382-2396.
[3] ZHAO Li-ke, ZHENG Shun-yi, WANG Xiao-nan, HUANG Xia. Rigid object position and orientation measurement based on monocular sequence[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(12): 2372-2381.
[4] HE Jie-guang, PENG Zhi-ping, CUI De-long, LI Qi-rui. Teaching-learning-based optimization algorithm with local dimension improvement[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(11): 2159-2170.
[5] LI Zhi, SHAN Hong, MA Tao, HUANG Jun. Group discovery of mobile terminal users based on reverse-label propagation algorithm[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(11): 2171-2179.
[6] WANG Shuo-peng, YANG Peng, SUN Hao. Construction process optimization of fingerprint database for auditory localization[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(10): 1973-1979.
[7] WEI Xiao-feng, CHENG Cheng-qi, CHEN Bo, WANG Hai-yan. Chain code based on independent edge number[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(9): 1686-1693.
[8] CHEN Rong-hua, WANG Ying-han, BU Jia-jun, YU Zhi, GAO Fei. Website accessibility sampling evaluation based on KNN and local regression[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(9): 1702-1708.
[9] ZHANG Cheng-zhi, FENG Hua-jun, XU Zhi-hai, LI Qi, CHEN Yue-ting. Piecewise noise variance estimation of images based on wavelet transform[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(9): 1804-1810.
[10] LIU Zhou-zhou, LI Shi-ning, LI Bin, WANG Hao, ZHANG Qian-yun, ZHENG Ran. New elastic collision optimization algorithm and its application in sensor cloud resource scheduling[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(8): 1431-1443.
[11] WANG Yong-chao, ZHU Kai-lin, WU Qi-xuan, LU Dong-ming. Adaptive display technology of high precision model based on local rendering[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(8): 1461-1466.
[12] SUN Nian, LI Yu-qiang, LIU Ai-hua, LIU Chun, LI Wei-wei. Microblog sentiment analysis based on collaborative learning under loose conditions[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(8): 1452-1460.
[13] ZHENG Shou-guo, CUI Yan-min, WANG Qing, YANG Fei, CHENG Liang. Design of field data acquisition platform for aircraft assembly[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(8): 1526-1534.
[14] BI Xiao-jun, WANG Chao. Many-objective evolutionary algorithm based on hyperplane projection[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(7): 1284-1293.
[15] ZHANG Ting-rong, TENG Qi-zhi, LI Zheng-ji, QING Lin-bo, HE Xiao-hai. Super-resolution reconstruction for three-dimensional core CT image[J]. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(7): 1294-1301.