I asked a similar question last year and want to know if there is any way to coax the compiler into mapping "vectorized" code onto the IGP?
More specifically, I'd like to launch a workgroup where each work item is a SIMD4 or SIMD4x2 vector and the number of vector registers per work item might approach 128.
I have a few interesting near-embarrassingly parallel kernels that require a few rounds of inter-lane communication across SIMD lanes. The kernels map well to AVX2 and GPU architectures with swizzle/permute/shuffle support. The kernels also work fine without a swizzle but bounce everything through local memory. Avoiding local mem provides a decent speedup.
With the current scalar-per-thread code generation (SIMD8/16/32?) and no access to inter-subgroup swizzles, all communication needs to be bounced through shared.... while executing in SIMD4/4x2 mode would presumably allow me to use the SIMD swizzle support.
I assume the answer is till "no" but vectorizer "knobs" might be a useful feature to add to future versions of the IGP compiler.