I was curious to see how a local work group was distributed across a subslice.
My assumption was that a cooperative group of SIMDxx hardware threads would be assigned to the least occupied EU.
That seems to be the case.
Here is a plot showing a SIMD8 kernel being launched with local work groups of size: 224, 128, 64, 32, 16, 8: