Hi,
I'm working on writing a global reduction in OpenCL 2.0. I started with the implementation from CLOGS:
https://sourceforge.net/p/clogs/wiki/Home/
Essentially, the approach is just a series of workgroup wide reductions that are combined at the end. I thought I would try updating the implementation to use the OpenCL 2.0 workgroup built-in reduction, i.e. work_group_reduce_add().
I was surprised that the global reduction performs slower when the workgroup reductions are computed using the built-in reduction. Specifically, I ran a test of 1000 reductions on randomly sized arrays (sizes in range 1 - 100000). The random numbers are provided the same seed, so they will be the same for different runs.
Using the built-in reduction, the combined total kernel time is ~39 ms. Using the CLOGS approach, the total kernel time is ~32 ms. I was also surprised to see that the kernel using the built-in reduction used 1796 bytes of local memory, while the CLOGS approach used only 1028 bytes (as reported by OpenCL code builder).
We're interested in learning why the built-in reduction appears to be slower and use more resources than the simple CLOGS approach. Perhaps there are trade-offs we're not aware of? We get similar results on an AMD GPU.
I've attached the kernel file that implements the global reduction (with both the built-in and CLOG approach to workgroup wide reductions).
The GPU I am using is:
Intel HD5500
Windows 10
driver: 20.19.15.4531
Thanks in advance!