Hi all,
I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel :
__kernel void gemv(const __global float4* M, const __global float4* V, uint width, uint height, __global float* W, __local float* partialDotProduct) { // Each work-group handles as many matrix rows as necessary for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) { // Row pointer const __global float4* row = M + (y * width/4); // Each work-item accumulates as many products as necessary // into local variable "sum" float4 sum = (float4) (0.0f); for (uint x = get_local_id(0); x < width/4; x += get_local_size(0)) sum = fma(row[x],V[x],sum); // Each partial dot product is stored in SLM partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f); // Perform parallel reduction to add each work-item's // partial dot product together for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) { // Synchronize to make sure each work-item is done updating SLM barrier(CLK_LOCAL_MEM_FENCE); // Only the first work-items in the work-group add elements together if (get_local_id(0) < stride) { // Add two elements from the "partialDotProduct" array // and store the result in partialDotProduct[index] partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride]; } } // Write the result of the reduction to global memory if (get_local_id(0) == 0) W[y] = partialDotProduct[0]; } }
On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).
Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?
I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.
Thanks
Thread Topic:
Question