I have written a benchmarking application for opencl https://github.com/krrishnarraj/clpeak . One of the tests include measuring compute capacity(gflops) of the device. When run on windows 32, it gives expected results on sandybridge as
Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Driver version: 1.2 (Win32)Single-precision compute (GFLOPS)
float : 25.19
float2 : 50.48
float4 : 50.37
float8 : 51.75
float16 : 51.85
Theoratical peak of this device is 76.8 gflops
But when same code runs on 64 bit, it gives a different result
Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Driver version: 1.2 (Win64)Single-precision compute (GFLOPS)
float : 25.15
float2 : 99.25
float4 : 172.25
float8 : 80.07
float16 : 96.42
Looks like vector code(float2, float4) has been optimized out to float or some out-of-order optimization has happend. Not sure what is happening!!
ASM output from kernel-analyzer has properly generated all fmad & fmul. Is there any optimization that is specific to x64? anything advanced?