I am trying to find some way to relax memory consistency imposed by OpenCL 2.0 run time. To clarify my goal suppose you have the following scenario:
- You have an fine-grained SVM memory object that is to be written by the CPU and GPU at the same time...
- You have some method that will launch 1 or more kernels on the GPU. Let's call this methodlaunch_kernels. All kernels launched by launch kernel will manipulate the SVM object.
- You have another CPU method that will also do some processing on the data of the SVM object. Let's call cpu_process.
- All GPU kernels AND the cpu_process method will calculate different regions of the SVM object.
So you can image a code scenario like this:
void* svm_obj; allocate_svm_object(&svm_obj); launch_kernels(svm_obj); // will return immediately without waiting for kernels to finish cpu_process(svm_obj); sync_gpu(); // wait for prev launched kernels to finish
Here is my situation:
- When launch_kernels(svm_obj) gets called individually (i.e. remove cpu_process(svm_obj); line above), it takes about 5 ms.
- When cpu_process(svm_obj); gets called individually (i.e. remove launch_kernels(svm_obj); and sync_gpu(); lines above), it takes also about 5 ms.
- When they are called in parallel together (i.e. the exact scenario above) each one takes an additional time of about 3 ms for a total of 8 ms each.
I suppose this additional overhead is added by the OpenCL run time to guarantee consistency of the SVM memory object. However, in my case, I can guarantee consistency without the run time's help because no memory location is written to by more than one execution unit.
My question is, is there a way to relax memory consistency of OpenCL 2.0 so that I can remove the additional overhead?