Hi,
Again, I've been trying to characterize when it makes sense to off load computation from CPU to IGP (i7-5775c CPU vs Iris Pro IGP). I noticed that for very simple kernels (e.g. a single fma, or min/max operation) that the CPU would greatly outperform IGP by up to 50%, and upon investigating it seems that kernel launch overhead has a lot to do with it. Some results to explain:
FMA Kernel (using FMA_LOOP = 1):
void kernel fmaKernel(global float * out){
float sum = out[get_global_id(0)];
for(int i = 0; i < FMA_LOOP; i++){
sum=fma(1.02345, FMA_LOOP.f, sum);
};
out[get_global_id(0)]=sum;
}
Do Nothing Kernel:
void kernel doNothing(global float * out){
return;
}
As a side note, I have forced my IGP to remain at full (1.15 GHz) clock speed. Ditto CPUs (3.7 GHz).
These results all reflect 2D square images (e.g. 32x32, 64x64, 128x128, 512x512, so on), with one thread per pixel.
So I'm led to believe that there's much higher overhead in creating IGP than CPU threads, which is surprising giving that the IGP is supposed to excel at handling lots of threads, and further this overhead is preventing the IGP from really shining in these experiments. I'm guessing what I could do is have each OpenCL thread process more than one pixel, however that requires complicating OpenCL kernels which don't have such great overhead on CPU.
So my question to the forum/Intel is why the overhead is so much greater on IGP than CPU? Or is there something with my experiments that is simply making it appear to be the case, when it can be explained otherwise? I've attempting measuring with OpenCL timers vs. wall clock and don't really see a difference.
Config:
Ubuntu 14.04 LTS
Intel OpenCL 1.2-5.0.0.43 (CPU-x64)
Intel OpenCL 1.2-1.0 (Graphics Driver for HD Graphics, Iris, Iris Pro)
Run benchmark:
./runBench.sh
For my system platform=0=IGP and platform=1=CPU.