Let's say each CPU socket has 43 GB/s of bandwidth through its four memory channels. Let's say I have a dual socket system. A reduction operation should achieve performance of 86 GB/s, but it doesn't. It will still only achieve up to 43 GB/s. Why is that and is there anything in Intel's OpenCL implementation for CPUs that can fix that?
How could I fix that outside of OpenCL?