Please wait a minute...
浙江大学学报(工学版)  2018, Vol. 52 Issue (3): 515-523    DOI: 10.3785/j.issn.1008-973X.2018.03.013
计算机与通信技术     
二维矩阵卷积的并行计算方法
张军阳, 郭阳, 扈啸
国防科技大学 计算机学院, 湖南 长沙 410073
Parallel computing method for two-dimensional matrix convolution
ZHANG Jun-yang, GUO Yang, HU Xiao
College of Computer, National University of Defense Technology, Changsha 410073, China
 全文: PDF(4089 KB)   HTML
摘要:

为了提高卷积神经网络模型中二维矩阵卷积的计算效率,基于FT2000多核向量处理器研究二维矩阵卷积的并行实现方法.通过使用广播指令将卷积核元素广播至向量寄存器,使用向量LOAD指令加载卷积矩阵行元素,并通过混洗操作将不易并行化的矩阵卷积操作变成可以向量化的乘加操作,实现了通过减少访存、充分复用已取数据的方式来提高算法的执行效率.设计卷积矩阵规模变化、卷积核规模不变和卷积矩阵规模不变、卷积核规模变化2种常用矩阵卷积计算方式,并对比分析不同计算方式对算法执行效率的影响.基于服务器级多核CPU和TI6678进行实验对比,实验结果显示,FT2000比多核CPU及TI6678具有更好的计算优势,相比多核CPU最高可加速11 974倍,相比TI6678可加速21倍.

Abstract:

A parallel implementation method based on multi-core vector processor FT2000 was proposed to improve the computational efficiency of two-dimensional matrix convolution in convolution neural network model. The convolution kernel element was broadcast to vector register by using broadcast instruction; the row elements of the convolution matrix were vector loaded. With shuffle operation,the operation of matrix convolution, which is hard to be parallelled, can be vectorized by using multiply-add operation, and the implementation efficiency was achieved through reduction of access, full reuse of obtained data. Two kinds of common matrix convolution methods were designed:changing convolution matrix scale with constant convolution kernel size, and constant convolution matrix size with changing convolution kernel scale. The influence of different calculation methods on the algorithm execution efficiency was analyzed and compared. Finally, the comparison experiments were taken based on the server-level multi-core CPU and TI6678. Results show that FT2000 has a better computing advantage over multi-core CPU and TI6678, which can accelerate up to 11 974 times compared to multi-core CPU, while to TI6678 it is 21 times.

收稿日期: 2017-03-04 出版日期: 2018-09-11
CLC:  TP391  
基金资助:

国家自然科学基金资助项目(60133007,61572025);国家重点研发计划资助项目(2016YFB0200401).

通讯作者: 郭阳,男,教授.orci.org/0000-0003-1600-4666.     E-mail: guoyang@nudt.edu.cn
作者简介: 张军阳(1987-),男,博士生,从事体系结构、机器学习、嵌入式系统研究.orcid.org/0000-0002-2993-4494.E-mail:zhangjunyang11@nudt.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  

引用本文:

张军阳, 郭阳, 扈啸. 二维矩阵卷积的并行计算方法[J]. 浙江大学学报(工学版), 2018, 52(3): 515-523.

ZHANG Jun-yang, GUO Yang, HU Xiao. Parallel computing method for two-dimensional matrix convolution. JOURNAL OF ZHEJIANG UNIVERSITY (ENGINEERING SCIENCE), 2018, 52(3): 515-523.

链接本文:

http://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2018.03.013        http://www.zjujournals.com/eng/CN/Y2018/V52/I3/515

[1] DENG L, YU D. Deep learning:methods and applications[J]. Foundations & Trends© in Signal Processing, 2014, 7(3):197-387.
[2] WU M,CHEN L. Image recognition based on deep learning[C]//Chinese Automation Congress. Wuhan:IEEE,2015.
[3] LI D, LI J Y, HUANG J T,et al. Recent advances in deep learning for speech research at Microsoft[C]//In the Proceedings of the 2013 IEEE International Conference on Acoustics,Speech and Signal Processing. Vancouver:IEEE,2013:8604-8608.
[4] KAVUKCUOGLU K, BOUREAU Y L, BOUREAU Y L, et al. Learning convolutional feature hierarchies for visual recognition[C]//International Conference on Neural Information Processing Systems.Vancouver:Curran Associates Inc. 2010:1090-1098.
[5] CHEN Z,WANG J,HE H,et al. A fast deep learning system using GPU[C]//Proceedings of International Symposium on Circuits and Systems. Melbourne:IEEE,2014:1552-1555.
[6] BOURLARD H,KAMP Y. Auto-association by multilayer-perceptrons and singular value decomposition[J]. Biological Cybernetics,1988,59(4/5):291-294.
[7] YAJIE MIAO,MOHAMMAD GOWAYYED,AND FLORI-AN METZE. EESEN:End-to-end speech recognition using deep RNN models and WFST-based decoding[C]//Automatic Speech Recognition and Understanding. Scottsdale:IEEE, 2015:167-174.
[8] LIU S,DU Z,TAO J,et al. Cambricon:an instruction-set architecture for neural networks[J]. ACM Sigarch Co-mputer Architecture News,2016,44(3):393-405.
[9] NASSE F, THURAU C, FINK G A. Face detection using GPU-based convolutional neural networks[C]//International Conference on Computer Analysis of Images and Patterns. Berlin Heidelberg:Springer, 2009:83-90.
[10] POTLURI S,FASIH A,VUTUKURU L K,et al. CNN-based high performance computing for real time image-processing on GPU[C]//The Workshop on Nonlinear Dynamics & Synchronization & Int'l Symposium on Theoretical Electrical Engineering. Klagenfurt:IEEE, 2011:1-7.
[11] YU Q,WANG C,MA X,et al. A deep learning predic-tion process accelerator based FPGA[J]. Proceedings of the Annual ACM Symposium on Theory of Computing,2015:585-594.
[12] HEGDE G,SIDDHARTHA,RAMASAMY N,et al. Evaluating embedded FPGA accelerators for deep learning applications[C]//IEEE,International Symposium on Field Programmable Custom Computing Machines. Washington DC.:IEEE,2016:25.
[13] CHEN T,DU Z,SUN N,et al. DianNao:a small-footprint high throughput accelerator for ubiquitous machine learning[J]. ACM Sigarch Computer Architecture News,2014,49(4):269-284.
[14] LIU D,CHEN T,LIU S,et al. PuDianNao:a polyvalent machine learning accelerator[C]//Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. Istanbul:ACM,2015:369-381.
[15] DU Z. ShiDianNao:shifting vision processing closer to the sensor[C]//ISCA'15 Proceedings of the,International Symposium on Computer Architecture. Portland:ISCA,2015:92-104.
[16] 刘仲,田希,陈磊.支持原位计算的高效三角矩阵乘法向量化方法[J].国防科技大学学报,2014,6(36):7-11. LIU Zhong,TIAN Xi,CHEN Lei. Efficient vectorization method of triangular matrix multiplication supporting in-place calculation[J]. Journal of National University of Defense Technology,2014,6(36):7-11.
[17] 刘仲,陈跃跃,陈海燕.支持任意系数长度和数据类型的FIR滤波器向量化方法[J].电子学报,2013,2(41):346-351. LIU Zhong,CHEN Yue-yue,CHEN Hai-yan. A vectorization of fir filter supporting arbitrary coefficients length and data types[J]. Acta Electronica Sinica,2013,2(41):346-351.
[18] 周海芳,高畅,方民权.基于CUBLAS和CUDA的MNF并行算法设计与优化[J].湖南大学学报:自科版,2017,4(44):147-156. ZHOU Hai-fang,GAO Chang,FANG Min-quan,Parallel algorithm design and performance optimization of maximum noise fraction rotation based on CUBLAS and CUDA[J]. Journal of Hunan University:Natural Sciences,2017,4(44):147-156.
[19] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[J]. Computer Science, 2015:4013-4021.
[20] ZAGORUYKO S, KOMODAKIS N. Learning to compare image patches via convolutional neural networks[C]//Computer Vision and Pattern Recognition. Boston:IEEE, 2015:4353-4361.
[21] POTLURI S, FASIH A, VUTUKURU L K, et al. CNN based high performance computing for real time image processing on GPU[C]//Nonlinear Dynamics and Synchronization. Klagenfurt.:IEEE, 2011:1-7.
[22] CHELLAPILLA K, PURI S, SIMARD P. High performance convolutional neural networks for document processing[C]//Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule:Suvisoft, 2006:inria-00112631.
[23] DONGARRA J J. An extended set of FORTRAN basic linear algebra subprograms[J]. ACM Transactions on Mathematical Software,1988,14(1):18-32.

[1] 韩勇, 宁连举, 郑小林, 林炜华, 孙中原. 基于社交信息和物品曝光度的矩阵分解推荐[J]. 浙江大学学报(工学版), 2019, 53(1): 89-98.
[2] 郑洲, 张学昌, 郑四鸣, 施岳定. 基于区域增长与统一化水平集的CT肝脏图像分割[J]. 浙江大学学报(工学版), 2018, 52(12): 2382-2396.
[3] 赵丽科, 郑顺义, 王晓南, 黄霞. 单目序列的刚体目标位姿测量[J]. 浙江大学学报(工学版), 2018, 52(12): 2372-2381.
[4] 何杰光, 彭志平, 崔得龙, 李启锐. 局部维度改进的教与学优化算法[J]. 浙江大学学报(工学版), 2018, 52(11): 2159-2170.
[5] 李志, 单洪, 马涛, 黄郡. 基于反向标签传播的移动终端用户群体发现[J]. 浙江大学学报(工学版), 2018, 52(11): 2171-2179.
[6] 王硕朋, 杨鹏, 孙昊. 听觉定位数据库构建过程优化[J]. 浙江大学学报(工学版), 2018, 52(10): 1973-1979.
[7] 魏小峰, 程承旗, 陈波, 王海岩. 基于独立边数的链码方法[J]. 浙江大学学报(工学版), 2018, 52(9): 1686-1693.
[8] 陈荣华, 王鹰汉, 卜佳俊, 于智, 高斐. 基于KNN算法与局部回归的网站无障碍采样评估[J]. 浙江大学学报(工学版), 2018, 52(9): 1702-1708.
[9] 张承志, 冯华君, 徐之海, 李奇, 陈跃庭. 图像噪声方差分段估计法[J]. 浙江大学学报(工学版), 2018, 52(9): 1804-1810.
[10] 刘洲洲, 李士宁, 李彬, 王皓, 张倩昀, 郑然. 基于弹性碰撞优化算法的传感云资源调度[J]. 浙江大学学报(工学版), 2018, 52(8): 1431-1443.
[11] 王勇超, 祝凯林, 吴奇轩, 鲁东明. 基于局部渲染的高精度模型自适应展示技术[J]. 浙江大学学报(工学版), 2018, 52(8): 1461-1466.
[12] 孙念, 李玉强, 刘爱华, 刘春, 黎威威. 基于松散条件下协同学习的中文微博情感分析[J]. 浙江大学学报(工学版), 2018, 52(8): 1452-1460.
[13] 郑守国, 崔雁民, 王青, 杨飞, 程亮. 飞机装配现场数据采集平台设计[J]. 浙江大学学报(工学版), 2018, 52(8): 1526-1534.
[14] 毕晓君, 王朝. 基于超平面投影的高维多目标进化算法[J]. 浙江大学学报(工学版), 2018, 52(7): 1284-1293.
[15] 张廷蓉, 滕奇志, 李征骥, 卿粼波, 何小海. 岩心三维CT图像超分辨率重建[J]. 浙江大学学报(工学版), 2018, 52(7): 1294-1301.