I'm evaluating an Intel platform for an embedded real-time processor in our systems. Our application uses OpenCL to prcoess incoming data on a very short cycle in real-time. It is critical to the system that it is able to keep up with the input data stream. Latency between input and output is also critical so we are not able to batch up data and process it in larger quantities. For these reasons, start latency for tasks on the OpenCL command queue is as critical as kernel processing speed.
One processing cycle looks something like this. Steps 7 to 11 are all using events to trigger the next step.
- Map input buffer
- Queue unmap input buffer (to be triggered by a user event)
- Queue kernels
- Queue map output buffer
- Copy data in
- Trigger unmap
- Unmap
- Kernel 1
- Kernel 2
- Kernel 3
- Map output buffer
- Copy data out
This sequence works very well on OpenCL on a different (non-Intel) processor but seems to suffer longer start latency than expected on this processor. Examples of latency (microseconds) between the some of these steps is shown below.
- end 7 (unmap) to start 8 (kernel 1) 700 - 1400
- end 8 (kernel 1) to start 9 (kernel 2) 400 - 900
- end 9 (kernel 2) to start 10 (kernel 3) 400 - 700
- end 10 (kernel 3) to start 11 (map) 300 - 600
These times are huge for our system which operates on a short real-time cycle.
Does anyone have some insight into what might be causing this and how we could reduce the times? Some specifics of the system are given below in case they might help.
Thanks, Tony
Linux: Yocto from the Apollo Lake BSP release gold, build core-image-sato-sdk, installed on onboard eMMC.
Hardware: Oxbow Hill Rev B CRB with Intel Atom E3950 and 8GB DDR3 RAM
OpenCL: installed user space drivers from SRB4 https://software.intel.com/file/533571/download