Consider the following kernel:
__kernel void test(__global float2 *output, __global float2 *input) { __local float lmem[8]; float2 a; const size_t tid = get_global_id(0); if(tid / 8 == 0) { a = input[tid]; } else { return; } lmem[tid] = -a.x; barrier(CLK_LOCAL_MEM_FENCE); a.x = lmem[tid]; barrier(CLK_LOCAL_MEM_FENCE); output[tid] = a; }
If I execute it with global size == local_size == 16 and pass an array of 16 float2 elements as input:
input = [ 0.+0.j 1.+0.j 2.+0.j 3.+0.j 4.+0.j 5.+0.j 6.+0.j 7.+0.j 8.+0.j 9.+0.j 10.+0.j 11.+0.j 12.+0.j 13.+0.j 14.+0.j 15.+0.j]
and a zero-filled buffer as output, I expect the first 8 elements of the output to have their real parts negated in the output array, while the rest of it remaining untouched:
output = [-0.+0.j -1.+0.j -2.+0.j -3.+0.j -4.+0.j -5.+0.j -6.+0.j -7.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j]
This is what happens on Ubuntu 12.04 x64, nVidia CUDA 5 platform, Tesla C2050 device. But on the same operating system, Intel OpenCL XE SDK 2013 3.0.67279, and Intel Xeon E5620 the whole resulting buffer remains untouched:
output = [ 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j]
The output coincides with the reference CUDA output if I do any of the following:
- Comment out the barriers;
- Use float arrays instead of float2
- Initialize "a" inside the kernel instead of reading it from input (i.e. as "a = (float2)(tid, 0)").
Has anyone encountered such behavior? Is it a bug, or am I making incorrect assumptions about how barriers work?