Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

dot product kernel doesn't work on CPUs

$
0
0

Hi,

I'm new to OpenCL and I have implemented a program to compute the
dot product. The program works as expected if I use a GPU and it
returns a wrong result if I use a CPU with more than one work-item
in a work-group. I was able to find the reason for the problem
using only two work-items per work-group and one work-group
per NDrange. I have two work-items before and after the reduction
operation if I use a GPU and only one work-item after the
reduction operation if I use a CPU so that the partial sum of the
work-group will not be stored. The program uses libOpenCL.so.1 from
opencl-1.2-sdk-6.3.0.1904, opencl_runtime_16.1.1_x64_sles_6.4.0.25,
and the OpenCL driver from CUDA-8.0. Does somebody know why I have
only one work-item after the reduction operation? Is something
wrong with my kernel (most likely) or have I detected a problem with
the Intel OpenCL implementation for CPUs (very unlikely)?

loki introduction 230 gcc dot_prod_OpenCL_orig.c errorCodes.c -lOpenCL
loki introduction 231 a.out

Try to find first GPU on available platforms.
...
  ********  Using platform 1  ********
    Use device Quadro K2200.

before reduction: local_id = 0
before reduction: local_id = 1
after reduction:  local_id = 0
after reduction:  local_id = 1
sum = 6.000000e+01

loki introduction 232 gcc dot_prod_OpenCL.c errorCodes.c -lOpenCL
loki introduction 233 a.out

Try to find first CPU on available platforms.
  ********  Using platform 0  ********
    Use device Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.

before reduction: local_id = 0
before reduction: local_id = 1
after reduction:  local_id = 1
sum = 2.265776e-316

loki introduction 234 strace a.out |& grep ocl
open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/libintelocl.so", O_RDONLY|O_CLOEXEC) = 5
open("/usr/local/intel/opencl-1.2-6.4.0.25/lib64/__ocl_svml_l9.so", O_RDONLY|O_CLOEXEC) = 3
loki introduction 235

dot_prod_OpenCL.h
-----------------

#define    VECTOR_SIZE          10
#define WORK_ITEMS_PER_WORK_GROUP 2    /* power of two    required    */
#define WORK_GROUPS_PER_NDRANGE   1

dotProdKernel.cl
----------------

#if defined (cl_khr_fp64) || defined (cl_amd_fp64)
  #include "dot_prod_OpenCL.h"

  __kernel void dotProdKernel (__global const double * restrict a,
                   __global const double * restrict b,
                   __global double * restrict partial_sum)
  {
    /* Use local memory to store each work-items running sum.        */
    __local double cache[WORK_ITEMS_PER_WORK_GROUP];

    double temp = 0.0;
    int    cacheIdx = get_local_id (0);

    for (int tid = get_global_id (0);
     tid < VECTOR_SIZE;
     tid += get_global_size (0))
    {
      temp += a[tid] * b[tid];
    }
    cache[cacheIdx] = temp;

    /* Ensure that all work-items have completed, before you add up the
     * partial sums of each work-item to the sum of the work-group
     */
    barrier (CLK_LOCAL_MEM_FENCE);

    /* Each work-item will add two values and store the result back to
     * "cache". We need "log_2 (WORK_ITEMS_PER_WORK_GROUP)" steps to
     * reduce all partial values to one work-group value.
     * WORK_ITEMS_PER_WORK_GROUP must be a power of two for this
     * reduction.
     */
    printf ("before reduction: local_id = %u\n", get_local_id (0));
    for (int i = get_local_size (0) / 2; i > 0; i /= 2)
    {
      if (cacheIdx < i)
      {
    cache[cacheIdx] += cache[cacheIdx + i];
    barrier (CLK_LOCAL_MEM_FENCE);
      }
    }
    printf ("after reduction:  local_id = %u\n", get_local_id (0));
    /* store the partial sum of this work-group                */
    if (cacheIdx == 0)
    {
      partial_sum[get_group_id (0)] = cache[0];
    }
  }
#else
  #error "Double precision floating point not supported."
#endif

Thank you very much for any help in advance.

Kind regards

Siegmar

 


Viewing all articles
Browse latest Browse all 1182

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>