Hi,
I'm new to OpenCL and I have implemented a program to compute the
dot product. The program works as expected if I use a GPU and it
returns a wrong result if I use a CPU with more than one work-item
in a work-group. I was able to find the reason for the problem
using only two work-items per work-group and one work-group
per NDrange. I have two work-items before and after the reduction
operation if I use a GPU and only one work-item after the
reduction operation if I use a CPU so that the partial sum of the
work-group will not be stored. The program uses libOpenCL.so.1 from
opencl-1.2-sdk-6.3.0.1904, opencl_runtime_16.1.1_x64_sles_6.4.0.25,
and the OpenCL driver from CUDA-8.0. Does somebody know why I have
only one work-item after the reduction operation? Is something
wrong with my kernel (most likely) or have I detected a problem with
the Intel OpenCL implementation for CPUs (very unlikely)?
loki introduction 230 gcc dot_prod_OpenCL_orig.c errorCodes.c -lOpenCL
loki introduction 231 a.out
Try to find first GPU on available platforms.
...
******** Using platform 1 ********
Use device Quadro K2200.
before reduction: local_id = 0
before reduction: local_id = 1
after reduction: local_id = 0
after reduction: local_id = 1
sum = 6.000000e+01
loki introduction 232 gcc dot_prod_OpenCL.c errorCodes.c -lOpenCL
loki introduction 233 a.out
Try to find first CPU on available platforms.
******** Using platform 0 ********
Use device Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.
before reduction: local_id = 0
before reduction: local_id = 1
after reduction: local_id = 1
sum = 2.265776e-316
loki introduction 234 strace a.out |& grep ocl
open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/libintelocl.so", O_RDONLY|O_CLOEXEC) = 5
open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/__ocl_svml_l9.so", O_RDONLY|O_CLOEXEC) = 3
loki introduction 235
dot_prod_OpenCL.h
-----------------
#define VECTOR_SIZE 10
#define WORK_ITEMS_PER_WORK_GROUP 2 /* power of two required */
#define WORK_GROUPS_PER_NDRANGE 1
dotProdKernel.cl
----------------
#if defined (cl_khr_fp64) || defined (cl_amd_fp64)
#include "dot_prod_OpenCL.h"
__kernel void dotProdKernel (__global const double * restrict a,
__global const double * restrict b,
__global double * restrict partial_sum)
{
/* Use local memory to store each work-items running sum. */
__local double cache[WORK_ITEMS_PER_WORK_GROUP];
double temp = 0.0;
int cacheIdx = get_local_id (0);
for (int tid = get_global_id (0);
tid < VECTOR_SIZE;
tid += get_global_size (0))
{
temp += a[tid] * b[tid];
}
cache[cacheIdx] = temp;
/* Ensure that all work-items have completed, before you add up the
* partial sums of each work-item to the sum of the work-group
*/
barrier (CLK_LOCAL_MEM_FENCE);
/* Each work-item will add two values and store the result back to
* "cache". We need "log_2 (WORK_ITEMS_PER_WORK_GROUP)" steps to
* reduce all partial values to one work-group value.
* WORK_ITEMS_PER_WORK_GROUP must be a power of two for this
* reduction.
*/
printf ("before reduction: local_id = %u\n", get_local_id (0));
for (int i = get_local_size (0) / 2; i > 0; i /= 2)
{
if (cacheIdx < i)
{
cache[cacheIdx] += cache[cacheIdx + i];
barrier (CLK_LOCAL_MEM_FENCE);
}
}
printf ("after reduction: local_id = %u\n", get_local_id (0));
/* store the partial sum of this work-group */
if (cacheIdx == 0)
{
partial_sum[get_group_id (0)] = cache[0];
}
}
#else
#error "Double precision floating point not supported."
#endif
Thank you very much for any help in advance.
Kind regards
Siegmar