Hi All!!!
I'm new to openCL and willing to compare performance gain between c code and openCL kernels.
Can someone please elaborate which method among these 2 is better/correct for profiling openCL code when comparing performance with c reference code:
1. Using QueryPerformanceCounter()/__rdtsc() cycles (called inside getTime Function)
ret |= clFinish(command_queue); //Empty the queue
getTime(&begin); ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, NULL); //Profiling Disabled. ret |= clFinish(command_queue); getTime(&end); g_NDRangePureExecTimeSec = elapsed_time(&begin, &end); //Performs: (end-begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE)
2. Using events profiling:
ret = clEnqueueMarker(command_queue, &evt1); //Empty the Queue ret |= clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_ws, NULL, 0, NULL, &evt1); ret |= clWaitForEvents(1, &evt1); ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_START, sizeof(cl_long), &begin, NULL); ret |= clGetEventProfilingInfo(evt1, CL_PROFILING_COMMAND_END, sizeof(cl_long), &end, NULL); g_NDRangePureExecTimeSec = (cl_double)(end - begin)/(CLOCK_PER_CYCLE*CLOCK_PER_CYCLE*CLOCK_PER_CYCLE); //nSec to Sec ret |= clReleaseEvent(evt1);
Furthermore I'm not using a dedicated graphics card and utilizing Intel HD 4600 integrated graphics for following piece of code:
__kernel void filter_rows(__global float *ip_img,\ __global float *op_img, \ int width, int height, \ int pitch,int N, \ __constant float *W) { __private int i=get_global_id(0); __private int j=get_global_id(1); __private int k; __private float a; __private int image_offset = N*pitch +N; __private int curr_pix = j*pitch + i +image_offset; // apply filter a = ip_img[curr_pix-8] * W[0 ]; a += ip_img[curr_pix-7] * W[1 ]; a += ip_img[curr_pix-6] * W[2 ]; a += ip_img[curr_pix-5] * W[3 ]; a += ip_img[curr_pix-4] * W[4 ]; a += ip_img[curr_pix-3] * W[5 ]; a += ip_img[curr_pix-2] * W[6 ]; a += ip_img[curr_pix-1] * W[7 ]; a += ip_img[curr_pix-0] * W[8 ]; a += ip_img[curr_pix+1] * W[9 ]; a += ip_img[curr_pix+2] * W[10]; a += ip_img[curr_pix+3] * W[11]; a += ip_img[curr_pix+4] * W[12]; a += ip_img[curr_pix+5] * W[13]; a += ip_img[curr_pix+6] * W[14]; a += ip_img[curr_pix+7] * W[15]; a += ip_img[curr_pix+8] * W[16]; // write output op_img[curr_pix] = (float)a; }
And similar code for column wise processing. I'm observing gain (openCL Vs optimized vectorized C-Ref) around 11x using method 1 and around 16x using method 2.
However I've noticed people claiming gains in the order of 200-300x, when using dedicated graphics cards.
So my questions are:
1. What magnitude of gain can I expect, if I run the same code in dedicated graphics card. Will it be similar order or graphics card will outperform Intel HD graphics?
2. Can i map WARP and thread concept from CUDA to Intel HD graphics (i.e. Number of threads executing in parallel)?
Help is appreciated.
Help is appreciated.