Optimizing a Matrix-Vector multiplication kernel

Hi all,

I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel :

__kernel void gemv(const __global float4* M,
	const __global float4* V,
	uint width, uint height,
	__global float* W,
	__local float* partialDotProduct)
{
	// Each work-group handles as many matrix rows as necessary


	for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {

		// Row pointer
		const __global float4* row = M + (y * width/4);

		// Each work-item accumulates as many products as necessary
		// into local variable "sum"
		float4 sum = (float4) (0.0f);

		for (uint x = get_local_id(0); x < width/4; x += get_local_size(0))
			sum = fma(row[x],V[x],sum);


		// Each partial dot product is stored in SLM
		partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f);

		// Perform parallel reduction to add each work-item's
		// partial dot product together

		for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {

			// Synchronize to make sure each work-item is done updating SLM
			barrier(CLK_LOCAL_MEM_FENCE);

			// Only the first work-items in the work-group add elements together
			if (get_local_id(0) < stride) {

				// Add two elements from the "partialDotProduct" array
				// and store the result in partialDotProduct[index]
				partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];
			}
		}

		// Write the result of the reduction to global memory
		if (get_local_id(0) == 0)
			W[y] = partialDotProduct[0];

	}
}

On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).

Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?

I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.

Thanks

Thread Topic:

Question

Optimizing a Matrix-Vector multiplication kernel

Thread Topic:

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Hull man, 27, dies after crashing car into a tree on the A165 near Brandesburton

Police charge man, 23, with assault and criminal damage following incident in...

Angry father ordered to compensate daughter’s male friend

Practice Sheet of Right form of verbs for HSC Students

Best 5 Happy Mothers Day Poems For Step Mother

Hyper-V replication "Enabling Replication Failed"

DMG Audio Limitless v1.01 WiN/OSX Incl Patched and Keygen-R2R

Joseph Bradley – Carlisle

Laura Pausini - Platinum Collection (3Cd) (2009) .mp3 - 320 Kbps

Drug dealing brothers caught with £74k stash in Newtown Linford home

Who’s been sentenced from Corby, Kettering, Ringstead, Rothwell, Rushden,...

Anthony Wahome Biography, Family, Wife and Children

Brunei reaffirms healthcare commitment

Materials Around Us Class 6 Worksheet Science Chapter 6

JESSIE ROGERSON ON JULY 10, 20...

Madonna – Behind Me (feat. Guido Dos Santos) – Single [iTunes Plus M4A]

Stories • Goddess Stepmom

Sri Lankan Actress Nadeesha Hemamali Hot Shoot

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides