Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

Optimizing a Matrix-Vector multiplication kernel

$
0
0

Hi all,

I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel : 

__kernel void gemv(const __global float4* M,
	const __global float4* V,
	uint width, uint height,
	__global float* W,
	__local float* partialDotProduct)
{
	// Each work-group handles as many matrix rows as necessary


	for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {

		// Row pointer
		const __global float4* row = M + (y * width/4);

		// Each work-item accumulates as many products as necessary
		// into local variable "sum"
		float4 sum = (float4) (0.0f);

		for (uint x = get_local_id(0); x < width/4; x += get_local_size(0))
			sum = fma(row[x],V[x],sum);


		// Each partial dot product is stored in SLM
		partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f);

		// Perform parallel reduction to add each work-item's
		// partial dot product together

		for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {

			// Synchronize to make sure each work-item is done updating SLM
			barrier(CLK_LOCAL_MEM_FENCE);

			// Only the first work-items in the work-group add elements together
			if (get_local_id(0) < stride) {

				// Add two elements from the "partialDotProduct" array
				// and store the result in partialDotProduct[index]
				partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];
			}
		}

		// Write the result of the reduction to global memory
		if (get_local_id(0) == 0)
			W[y] = partialDotProduct[0];

	}
}

On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).

Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?

I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.

Thanks

Thread Topic: 

Question

Viewing all articles
Browse latest Browse all 1182

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>