I have a kernel that calculates motion vectors with fullsearch and mse. There is weird performance issues with the following loop:
#define W_SIZE 16 for (int y = 0; y != W_SIZE; y++) { for(uint x = 0; x != W_SIZE; x++) { float img1 = img1V[x + (y)*W_SIZE]; float img2 = img2V[x + (localID&VALUE) + (y+ localID/W_SIZE)*W_SIZE*2]; float img3 = img3V[x + (y)*W_SIZE]; float result = img1-img2; float result2 = img3-img2; diffs += result*result; diffs2 += result2 * result2; } }
The whole kernel takes about 360ms to execute. However if I change the outer loop to iterate from -1 to W_SIZE-1 and at 1 to y inside the loop (or any other values for that matter) execution time drops to 170ms.
Only reason I've come up with for this issue is loop unrolling that only happens when the loop iterates from 0 to scalar but using #pragma unroll has pracically no impact to performance. I tried making one loop instead of two nested ones but it still took 360ms to finish.
Does anybody have any idea what is causing this and if there is any way to fix it?