Hello,
I have some theoretical questions to have a better understanding of Intel FPGA OpenCL Compiler. First of all, I still don't know when to prefer NDRange Kernel over Single-Task Kernel. To my understanding, it is possible to data-parallelize the kernel with more flexibility in single-task kernel by using unroll loop pragma. By using this pragma we can indicate the parts we want vectorization. On the other hand NDRange Kernel offers simd pragma, which is bound to multiples of 2(why?) and requires the programmer to fix the size of work group size. NDRange concept fits well to other OpenCL platforms because of their fixed hardware consisting of multiple compute units but I can not grasp its necessity for FPGA.
Secondly, I would like to know when to prefer multiple compute units over simd. According to Best Practices Guide it is a bit of experimentation with the numbers to get the best results(best combination of compute units and simd) . But I can not think of a possible scenario that we have n compute units that has no memory coalescing would give better performance than n simd units. It comes to me as if it is always better to decrease number of compute units by a factor of n and increase simd units by the same factor(as long as we have enough resources). If this is the case, what is the justification of existence of multiple compute units pragma?
Lastly, after optimizing the number of compute units and simd units, what procedure should we follow in order to find the best work group size? Best Practices Guide states that each work group can only work on one compute unit. So, that should mean that the number of work groups we have must be a multiple of number of compute units we created for better performance(or not?). I always aimed for having least value for reqd_work_group_size attribute so that choosing global work size becomes easier(as it has to be a multiple of work group size for my device). What is a more elegant way of choosing work group size?
Regards,
Gorkem