大点数FFT在“申威26010”上的并行优化

doi:10.3785/j.issn.1008-973X.2024.01.009

浙江大学学报(工学版)

2024, Vol. 58

Issue (1): 78-86 DOI: 10.3785/j.issn.1008-973X.2024.01.009

计算机技术

大点数FFT在“申威26010”上的并行优化

郭俊1,2(

),刘鹏3,杨昕遥4,张鲁飞5,吴东5

1. 湖州职业技术学院信息工程与物联网学院，浙江湖州 313000
2. 湖州职业技术学院湖州市物联网智能系统集成技术重点实验室，浙江湖州 313000
3. 浙江大学信息与电子工程学院，浙江杭州 310027
4. 蚂蚁科技集团股份有限公司，浙江杭州 310013
5. 数学工程与先进计算国家重点实验室，江苏无锡 214125

Parallel optimization of large-point FFT on Sunway 26010

Jun GUO1,2(

),Peng LIU3,Xinyao YANG4,Lufei ZHANG5,Dong WU5

1. School of Information Engineering and Internet of Things, Huzhou Vocational and Technical College, Huzhou 313000, China
2. Huzhou Key Laboratory of IoT Intelligent System Integration Technology, Huzhou Vocational and Technical College, Huzhou 313000, China
3. College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
4. Ant Group Limited Company, Hangzhou 310013, China
5. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China

全文: PDF(1231 KB) HTML

摘要：

根据“神威·太湖之光”超级计算机所用国产“申威26010”处理器的架构特点和编程规范，提出针对大点数FFT的众核并行优化方案. 该方案源自经典的Cooley-Tukey FFT算法，通过将一维大点数数据迭代分解为二维小规模矩阵进行并行加速. 为了解决矩阵“列FFT”的读写、转置和计算问题，提出“列均分-行连续”的读写策略，通过对数据进行合理的分配、重排、交换，结合SIMD向量化、旋转因子优化、双缓冲、寄存器通信、跨步传输等优化手段，充分利用了众核处理器的计算资源和传输带宽. 实验结果显示，单核组64从核并行程序较主核运行FFTW库，可以达到最高65x、平均48x以上的加速比.

关键词： 神威·太湖之光; 申威26010; 快速傅里叶变换; Cooley-Tukey算法; 众核并行

Abstract:

A many-core parallel optimization scheme for large-point FFT was proposed according to the structural characteristics and programming specifications of the domestic Sunway 26010 processor, which was used in the Sunway Taihu Light supercomputer. The scheme was derived from the classic Cooley-Tukey FFT algorithm, and was accelerated in parallel by iteratively decomposing the one-dimensional large-point data into two-dimensional small-scale matrices. The "column-sharing, row-continuity" strategy was specially proposed in order to solve the problem of reading, writing, transposing and calculating of the "column FFT" of the matrix. The computing resources and transmission bandwidth of the many-core processor were fully utilized by reasonable data allocation, rearrangement and exchange combined with other optimization methods such as SIMD vectorization, twiddle factor optimization, double-buffering, register communication and stride transmission. The experimental results prove that the single core-group of 64 slave cores running parallel program can achieve a maximum speed-up of 65x and an average speed-up of more than 48x compared with the main core running the FFTW library.

Key words: Sunway Taihu Light Sunway 26010 fast Fourier transform Cooley-Tukey algorithm many-core parallelism

收稿日期: 2022-12-30 出版日期: 2023-11-07

CLC:

TP 338

基金资助: 数学工程与先进计算国家重点实验室开放基金资助项目（2019A10）

作者简介: 郭俊（1988—），男，讲师，博士，从事多核并行优化及超大规模集成电路设计的研究. orcid.org/0000-0003-4689-927X.E-mail： 2021049@huvtc.edu.cn

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	郭俊
	刘鹏
	杨昕遥
	张鲁飞
	吴东

引用本文:

郭俊,刘鹏,杨昕遥,张鲁飞,吴东. 大点数FFT在“申威26010”上的并行优化[J]. 浙江大学学报(工学版), 2024, 58(1): 78-86.

Jun GUO,Peng LIU,Xinyao YANG,Lufei ZHANG,Dong WU. Parallel optimization of large-point FFT on Sunway 26010. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 78-86.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.01.009 或 https://www.zjujournals.com/eng/CN/Y2024/V58/I1/78

图 1 “申威26010”处理器的示意图

图 2 一维转二维FFT的计算流程

图 3 “列均分-行连续”策略

图 4 从核数据分布

图 5 SIMD FFT数据排列

图 6 双缓冲机制

图 7 “大矩阵”“列FFT”的计算流程

图 8 采用旋转因子优化、SIMD、跨步传输、双缓冲方案的加速比

表 1 旋转因子优化前、后的计算次数

图 9 DMA传输带宽及利用率

图 10 传输和计算各部分功能的耗时占比

表 2 并行FFT的加速测试结果

21	SUN Jiadong, SUN Qiao, DENG Pan, et al Research on the optimization of blas level 1 and 2 functions on shenwei many-core processor[J]. Computer Systems and Application, 2017, 26 (11): 101- 108 doi: 10.15888/j.cnki.csa.006045
22	徐燚. 基于新一代申威众核处理器的BLAS并行优化的研究[D]. 上海: 华东师范大学, 2022: 15. XU Yi. Research on parallel optimization of BLAS based on the new generation of sunway many-core processor [D]. Shanghai: East China Normal University, 2022: 15.
23	赵玉文, 敖玉龙, 杨超, 等申威26010众核处理器上一维FFT实现与优化[J]. 软件学报, 2020, 31 (10): 3184- 3196 doi: 10.13328/j.cnki.jos.005848
1	TOP500. Top500 lists june 2016 [EB/OL]. [2016-06-30]. https://www.top500.org/lists/top500/2016/06/.
2	COOLEY J W, TUKEY J W An algorithm for the machine calculation of complex Fourier series[J]. Mathematics of Computation, 1965, 19 (90): 297- 301 doi: 10.1090/S0025-5718-1965-0178586-1
3	PRIYANKA C, LAKSHMI S, SAMHITHA A, et al High speed FFT computation using optimized radix 8 algorithm[J]. International Journal of Advanced Science and Technology, 2020, 29 (4): 6307- 6312
4	ELSHAFIY A, EL-MOTAZ M A, FARAG M E, et al. On optimization of mixed-radix FFT: a signal processing approach [C]// Proceedings of the IEEE Wireless Communications and Networking Conference. Marrakesh: IEEE, 2019: 1-6.
5	STASINSKI R. Split multiple radix FFT [C]// 30th European Signal Processing Conference. Belgrade: IEEE, 2022: 2251-2255.
6	KULKARNI V, OZA S Design and implementation of 3d FFT[J]. International Journal of Scientific and Engineering Research, 2017, 8 (4): 186- 189
7	GARRIDO M, SÁNCHEZ M A, LÓPEZ-VALLEJO M L, et al A 4096-point radix-4 memory-based FFT using DSP slices[J]. IEEE Transactions on Very Large Scale Integration Systems, 2017, 25 (1): 375- 379 doi: 10.1109/TVLSI.2016.2567784
8	XIA K F, WU B, XIONG T, et al A memory-based FFT processor design with generalized efficient conflict-free address schemes[J]. IEEE Transactions on Very Large Scale Integration Systems, 2017, 25 (6): 1919- 1929 doi: 10.1109/TVLSI.2017.2666820
9	INGEMARSSON C, KÄLLSTRÖM P, QURESHI F, et al Efficient FPGA mapping of pipeline SDF FFT cores[J]. IEEE Transactions on Very Large Scale Integration Systems, 2017, 25 (9): 2486- 2497 doi: 10.1109/TVLSI.2017.2710479
10	SHIH X Y, CHOU H R, LIU Y Q VLSI design and implementation of reconfigurable 46-mode combined-radix-based FFT hardware architecture for 3GPP LTE applications[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2018, 65 (1): 118- 129 doi: 10.1109/TCSI.2017.2725338
11	FRIGO M, JOHNSON S G The design and implementation of fftw3[J]. Proceedings of the IEEE, 2005, 93 (2): 216- 231 doi: 10.1109/JPROC.2004.840301
12	INTEL. Intel oneapi math kernel library [EB/OL]. [2020-09-20]. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.
13	NVIDIA. Fast Fourier transforms for nvidia GPUs [EB/OL]. [2021-03-30]. https://developer.nvidia.com/cufft.
14	AYALA A, TOMOV S, HAIDAR A, et al. Heffte: highly efficient FFT for exascale [C]// Proceedings of the International Conference on Computational Science. Amsterdam: Springer, 2020: 262-275.
15	张明. 龙芯平台上高性能计算的性能优化关键问题研究[D]. 合肥: 中国科学技术大学, 2017: 47. ZHANG Ming. Research on key issues of performance optimization in high performance computing based on the Godson [D]. Hefei: University of Science and Technology of China, 2017: 47.
16	郭金鑫. 实数FFT算法在ARM V8处理器上的实现与性能优化研究[D]. 太原: 太原理工大学, 2021: 33. GUO Jinxin. Real FFT implementation and performance optimization on ARM V8 processor [D]. Taiyuan: Taiyuan University of Technology, 2021: 33.
17	操庐宁. 基于X86-64计算平台的FFT实现与并行优化[D]. 长春: 吉林大学, 2019: 27. CAO Luning. FFT implementation and parallel optimization based on X86-64 computing platform [D]. Changchun: Jilin University, 2019: 27.
18	YANG C, XUE W, FU H H, et al. 10m-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics [C]// Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Salt Lake City: IEEE, 2016: 57-68.
19	FU H H, YIN W W, YANG G W, et al. 18.9-pflops nonlinear earthquake simulation on sunway taihulight: enabling depiction of 18-Hz and 8-meter scenarios [C]// Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Denver: ACM, 2017: 13-24.
20	LIU Y, LIU X, LI F, et al. Closing the "quantum supremacy" gap: achieving real-time simulation of a random quantum circuit using a new sunway supercomputer [C]// Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis Missouri: ACM, 2021: 25–36.
21	孙家栋, 孙乔, 邓攀, 等基于申威众核处理器的1、2级BLAS函数优化研究[J]. 计算机系统应用, 2017, 26 (11): 101- 108 doi: 10.15888/j.cnki.csa.006045

[1]	覃承进,蒋俊正. 基于深度神经网络的雷达距离超分辨方法[J]. 浙江大学学报(工学版), 2023, 57(6): 1215-1223.
[2]	骆阳,张为. 基于改进旋转因子的高性能FFT硬件设计[J]. 浙江大学学报(工学版), 2021, 55(6): 1199-1207.
[3]	孙曙光,田朋,杜太行,王景芹. 基于apFFT-AMD的密集频率谐波/间谐波检测[J]. 浙江大学学报(工学版), 2020, 54(1): 178-188.
[4]	张心怡, 杨家强, 张晓军. 基于机会约束的含多风电场动态经济调度[J]. 浙江大学学报(工学版), 2017, 51(5): 976-983.
[5]	楼文娟,王嘉伟,杨伦,陈勇. 雷暴风三维脉动风速场数值模拟[J]. 浙江大学学报(工学版), 2014, 48(7): 1162-1169.
[6]	张进高,卢琴芬,王利,叶云岳, 洪伟明. 基于FFT的直线电机地铁牵引电机功率因数角测量[J]. J4, 2012, 46(11): 1981-1984.
[7]	梅德庆刚宪约陈子辰. 一种快速高精度Duhamel积分算法[J]. J4, 2005, 39(8): 1152-1155.

Viewed

Full text

Abstract

Cited

Shared

Discussed