Intel IGP CL team,
I'm seeing a huge performance regression between Haswell and Broadwell when comparing 64-bit ulong's or performing min/max operations.
Please check the number of instructions that Broadwell is generating for the two ulong (64-bit) "compare-exchange" sequences below.
The 64-bit compare-exchange sequences are running half as fast on Broadwell when compared to Haswell.
The 32-bit compare-exchanges appear to be correct (they're very fast).