The FAQ states "Yes, Intel OpenCL* SDK 2013 introduces performance improvements that include full code generation on the Intel Advanced Vector Extensions (Intel AVX and Intel AVX2)."
I'm trying to get it to produce code that utilises the AVX2 FMA3 instructions.
I'm using the Kernel Builder (CPU - 64 bit AVX2) i.e. target set for AVX2 instruction set.
-------------
__kernel void dofma(const global float *a, const global float *b, const global float *c, global float *out)
{
uint gid= get_global_id(0);
float fa = a[gid];
float fb = b[gid];
float fc = c[gid];
fa = mad(fa,fb,fc);
out[gid] = fa;
}
------------------
Gives code that uses vmulps and vaddps but not VFMADD213 type code
using fa = fma(fa,fb,fc);
produces alot more code and a function call for the fma which results in very low performance.