Hi,
I have written a simple memcpy kernel as written below:
I am analyzing its performance on GPU using vtune.
__kernel void deinterlace_Y(__read_only image2d_t YIn, __write_only image2d_t YOut) { /* Doing operation of Memcpy */ int2 coord_src = (int2)(get_global_id(0), get_global_id(1)); const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE; uint4 pixel4 = read_imageui(YIn, smp, coord_src); write_imageui(YOut, coord_src, pixel4);
}
I observe the below stats for Execution units:
EU Array
Active Stalled Idle
24.6% 18.1% 57.2%
Also my computing threads started number is 24,525,023, which is quite high.I don't know how to reduce the number of threads started here and result in increased performance.
I can't understand how to improve its performance. I have gone through this link on optimizationshttps://int2-software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. At this link all the optimizations are related to buffers where we can read 16 elements from memory in one go. But in my case since I am using Texture memory reads or image API's I don't know the way to increase the performance