Hello,
I am currently comparing my own implemention of GEMV in OpenCL to the MKL. I am benchmarking very small input sizes like 2x64 for example. On my system the MKL runs around 0,001ms for this input size and my kernel runs around 0,003ms.
When executing a completly empty kernel I get a runtime of around 0,0025ms. Where does this overhead come from and why doesn't the MKL seem to have it? I am benchmarking my OpenCL kernel via the OpenCL events and MKL with the dsecnd() function, that is supplied by the MKL.
Thanks in advance!
Thread Topic:
Question