Two questions:
(1) What is the expected behavior of out-of-order queues on GEN9 + NEO?
I'm issuing a number of small kernels into an out of order command queue with profiling enabled and no barriers between the NDRanges.
I'm not seeing kernels being run concurrently despite each kernel only using a fraction of a sub-slice (3 sub-slices available).
The benchmark is being run for one iteration.
(2) What is the expected profiling behavior of enqueued barriers?
I'm enqueueing a barrier with no wait list between each NDRange and looking at the start and end time of both the barriers and kernels.
Barriers are reporting immensely long execution times (end - start) ... often in the 6-10 milliseconds when an event is attached.
Furthermore, an enqueued barrier's start time appears to begin before kernels preceding it in the out of order command queue.
This is unintuitive and the durations seem impossibly long.
But... adding to the confusion, is that the interleaved kernel NDRanges seem to start and end back-to-back with only a few microseconds delay similar to (1).
Summary
What am I missing with out-of-order queues on GEN/NEO and are the reported durations of barriers correct?
Examples:
Each example list the order the command is issued, its type and it's start/end/duration in nanonseconds (via profiling).
Out-of-order queue with no barriers (which is not what I want):
[0 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275065828645407, 275065828856573, 211166
[1 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275065828858044, 275065828867044, 9000
[2 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275065828867855, 275065828913105, 45250
Out-of-order queue with barriers between kernels but with NULL for the barrier's event
[0 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275228853506912, 275228853721495, 214583
[1 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275228853722460, 275228853732710, 10250
[2 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275228853736931, 275228853776931, 40000
Out-of-order queue with barriers that record an event for profiling:
[0 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275372161923465, 275372162135631, 212166
[1 ] CL_COMPLETE CL_COMMAND_BARRIER : 275372158781086, 275372162447953, 3666867
[2 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275372162137683, 275372162146683, 9000
[3 ] CL_COMPLETE CL_COMMAND_BARRIER : 275372158807180, 275372164875723, 6068543
[4 ] CL_COMPLETE CL_COMMAND_NDRANGE_KERNEL : 275372162148451, 275372162192534, 44083
[5 ] CL_COMPLETE CL_COMMAND_BARRIER : 275372158836095, 275372167017520, 8181425
( Ignore any minor swings in kernel execution time )