I'm storing 8x64-bit quad-words (SIMD8) to SLM and am trying to understand some curious GEN sequences.
The OpenCL line of code in question is a store to a doubly indexed array in SLM:
shared.m0[2][local_id] = r1;
Why does this indexed store to SLM result in 4-6 "mov" operations and two sends?
I assume some MOV operations are necessary to prepare a SEND "message"?
But why are there two SEND ops?
send (8|M0) null:ud r27:ud 0xA 0x40F0020 // hdc.dc0 wr:2h, rd:0, wr.scrdwfc: 0x70020 send (8|M0) null:ud r59:ud 0xC 0x6026CFE // hdc.dc1 wr:3, rd:0, wr.usurf msc:44, to SLM
I understand the second SEND but what is the first doing that's necessary? Is it a queue barrier of some sort?
Also, why are there so many MOV operations for this 8x64-bit SIMD8 store?