Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

builtin workgroup reduction performance

$
0
0

Hi,

I'm working on writing a global reduction in OpenCL 2.0. I started with the implementation from CLOGS:
https://sourceforge.net/p/clogs/wiki/Home/

Essentially, the approach is just a series of workgroup wide reductions that are combined at the end. I thought I would try updating the implementation to use the OpenCL 2.0 workgroup built-in reduction, i.e. work_group_reduce_add().

I was surprised that the global reduction performs slower when the workgroup reductions are computed using the built-in reduction. Specifically, I ran a test of 1000 reductions on randomly sized arrays (sizes in range 1 - 100000). The random numbers are provided the same seed, so they will be the same for different runs.

Using the built-in reduction, the combined total kernel time is ~39 ms. Using the CLOGS approach, the total kernel time is ~32 ms. I was also surprised to see that the kernel using the built-in reduction used 1796  bytes of local memory, while the CLOGS approach used only 1028 bytes (as reported by OpenCL code builder).

We're interested in learning why the built-in reduction appears to be slower and use more resources than the simple CLOGS approach. Perhaps there are trade-offs we're not aware of? We get similar results on an AMD GPU.

I've attached the kernel file that implements the global reduction (with both the built-in and CLOG approach to workgroup wide reductions).

The GPU I am using is:
Intel HD5500
Windows 10
driver: 20.19.15.4531

Thanks in advance!

AttachmentSize
Downloadtext/plainreduce.txt3.53 KB

Thread Topic: 

Question

Viewing all articles
Browse latest Browse all 1182

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>