We are evaluating GPU solutions to accelerate our algorithms, which have unavoidable sequential parts at the core.
Because of data dependencies we would need to send up-to 300000 commands per second to the GPU which introduce considerable slowdown.
It seems NVIDIA exposes "dynamic parallelism" through CUDA which would seems helpful to lessen this command submission stress. OpenCL 2.0 will offer something similar under the name "device enqueing" which will allow to launch child kernels and such.
When will Intel OpenCL drivers with "dynamic parallelism" support be released?