Please take a look at my OpenCL 2.0 tutorial on the use of enqueue_kernel and work-group scan functions. It also has a very cool algorithm, GPU-Quicksort, implemented in both OpenCL 1.2 and 2.0.
https://software.intel.com/en-us/articles/gpu-quicksort-in-opencl-20-usi...
Let me know what you think!