使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性

doi:10.1631/FITEE.1500032

Front. Inform. Technol. Electron. Eng.

2015, Vol. 16

Issue (11): 899-916 DOI: 10.1631/FITEE.1500032

使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen

School of Computer, National University of Defense Technology, Changsha 410073, China; National Key Laboratory of Parallel and Distributed Processing, Changsha 410073, China

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen

School of Computer, National University of Defense Technology, Changsha 410073, China; National Key Laboratory of Parallel and Distributed Processing, Changsha 410073, China

全文: PDF

摘要： 目的：针对面向GPU设计的OpenCL kernel程序在CPU上性能移植性欠佳这一问题，设计一种基于访存特征分析的代码转换方法，提升性能移植性。
创新点：通过分析OpenCL kernel中的访存模式，去除不必要的局部存储数组及其带来的同步语句，并使用向量化和局域性重开发进一步优化代码，最终取得显著的性能提升。
方法：首先，针对OpenCL kernel代码中的数组访问，设计一种精确的线性化访问描述子（图2）。然后，利用该描述子，分两步对GPU特定的OpenCL kernel代码进行转换，以提高其在CPU上的性能（图7）。第一步为基于分析的work-item折叠，即通过分析访问描述子，找出并去除不必要的局部存储数组及其带来的同步语句，然后完成work-item折叠。第二步为适应架构的代码优化，即针对CPU架构的特点，使用向量化和局域性重开发进一步优化折叠后的代码。最后，上述代码转换过程被整合为一个工具链，连同一个调度程序，嵌入到一个开源的OpenCL运行时系统中（图11）。实验结果表明，这种转换方法可以显著提升GPU特定的OpenCL kernel在Intel Sandy Bridge架构CPU和Intel Knights Corner架构协处理器上的性能。
结论：准确分析OpenCL kernel代码中的访存模式，不仅利于判断局部存储数组是否适合于CPU架构，还能用于指导之后的代码优化过程，因此是提高性能移植性的重要步骤。

关键词： OpenCL; 性能移植性; 多核/众核CPU; 基于分析的转换

Abstract: OpenCL is an open heterogeneous programming framework. Although OpenCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of OpenCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.

Key words: OpenCL Performance portability Multi-core/many-core CPU Analysis-based transformation

收稿日期: 2015-01-30 出版日期: 2015-11-04

CLC:

TP312

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	Mei Wen
	Da-fei Huang
	Chang-qing Xun
	Dong Chen

引用本文:

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen. Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations. Front. Inform. Technol. Electron. Eng., 2015, 16(11): 899-916.

链接本文:

http://www.zjujournals.com/xueshu/fitee/CN/10.1631/FITEE.1500032 或 http://www.zjujournals.com/xueshu/fitee/CN/Y2015/V16/I11/899

[1]	Chang-qing Xun, Dong Chen, Qiang Lan, Chun-yuan Zhang. Efficient fine-grained shared buffer management for multiple OpenCL devices[J]. Front. Inform. Technol. Electron. Eng., 2013, 14(11): 859-872.

Viewed

Full text

Abstract

Cited

Shared

Discussed