Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

doi:10.1631/FITEE.1500032

Front. Inform. Technol. Electron. Eng.

2015, Vol. 16

Issue (11): 899-916 DOI: 10.1631/FITEE.1500032

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen

School of Computer, National University of Defense Technology, Changsha 410073, China; National Key Laboratory of Parallel and Distributed Processing, Changsha 410073, China

Download:

PDF(0KB)
Export: BibTeX | EndNote (RIS)

Abstract OpenCL is an open heterogeneous programming framework. Although OpenCL programs are functionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typically, the use of OpenCL’s local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements are also achieved on Intel’s many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.

Key words： OpenCL Performance portability Multi-core/many-core CPU Analysis-based transformation

Received: 30 January 2015 Published: 04 November 2015

CLC:

TP312

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Mei Wen
	Da-fei Huang
	Chang-qing Xun
	Dong Chen

Cite this article:

Mei Wen, Da-fei Huang, Chang-qing Xun, Dong Chen. Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations. Front. Inform. Technol. Electron. Eng., 2015, 16(11): 899-916.

URL:

http://www.zjujournals.com/xueshu/fitee/10.1631/FITEE.1500032 OR http://www.zjujournals.com/xueshu/fitee/Y2015/V16/I11/899

使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性

目的：针对面向GPU设计的OpenCL kernel程序在CPU上性能移植性欠佳这一问题，设计一种基于访存特征分析的代码转换方法，提升性能移植性。
创新点：通过分析OpenCL kernel中的访存模式，去除不必要的局部存储数组及其带来的同步语句，并使用向量化和局域性重开发进一步优化代码，最终取得显著的性能提升。
方法：首先，针对OpenCL kernel代码中的数组访问，设计一种精确的线性化访问描述子（图2）。然后，利用该描述子，分两步对GPU特定的OpenCL kernel代码进行转换，以提高其在CPU上的性能（图7）。第一步为基于分析的work-item折叠，即通过分析访问描述子，找出并去除不必要的局部存储数组及其带来的同步语句，然后完成work-item折叠。第二步为适应架构的代码优化，即针对CPU架构的特点，使用向量化和局域性重开发进一步优化折叠后的代码。最后，上述代码转换过程被整合为一个工具链，连同一个调度程序，嵌入到一个开源的OpenCL运行时系统中（图11）。实验结果表明，这种转换方法可以显著提升GPU特定的OpenCL kernel在Intel Sandy Bridge架构CPU和Intel Knights Corner架构协处理器上的性能。
结论：准确分析OpenCL kernel代码中的访存模式，不仅利于判断局部存储数组是否适合于CPU架构，还能用于指导之后的代码优化过程，因此是提高性能移植性的重要步骤。

关键词： OpenCL, 性能移植性, 多核/众核CPU, 基于分析的转换

[1]	Chang-qing Xun, Dong Chen, Qiang Lan, Chun-yuan Zhang. Efficient fine-grained shared buffer management for multiple OpenCL devices[J]. Front. Inform. Technol. Electron. Eng., 2013, 14(11): 859-872.

Viewed

Full text

Abstract

Cited

Shared

Discussed