Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

Weird performance with loop unrolling

$
0
0

I have a kernel that calculates motion vectors with fullsearch and mse. There is weird performance issues with the following loop:

  #define W_SIZE 16
  for (int y = 0; y != W_SIZE; y++)
  {
    for(uint x = 0; x != W_SIZE; x++)
    {
      float img1 = img1V[x + (y)*W_SIZE];
      float img2 = img2V[x + (localID&VALUE) + (y+ localID/W_SIZE)*W_SIZE*2];
      float img3 = img3V[x + (y)*W_SIZE];
      float result = img1-img2;
      float result2 = img3-img2;
      diffs += result*result;
      diffs2 += result2 * result2;
    }
  }

The whole kernel takes about 360ms to execute. However if I change the outer loop to iterate from -1 to W_SIZE-1 and at 1 to y inside the loop (or any other values for that matter) execution time drops to 170ms.

Only reason I've come up with for this issue is loop unrolling that only happens when the loop iterates from 0 to scalar but using #pragma unroll has pracically no impact to performance. I tried making one loop instead of two nested ones but it still took 360ms to finish.

Does anybody have any idea what is causing this and if there is any way to fix it?


Viewing all articles
Browse latest Browse all 1182

Trending Articles