Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

clEnqueueMap/UnmapBuffer overheads

$
0
0

Can someone please straighten me out on expected clEnqueueMapBuffer overheads under Haswell?

Environment: Windows 7 sp1, VS2013, i7-4770, driver 10.18.14.4170

I have my own 768kb buffer which needs to be accessed by the HD 4600 GPU.
I clCreateBuffer with CL_MEM_ALLOC_HOST_PTR, which I believe sets aside some pinned memory for later use. Later on I use clEnqueueMapBuffer (with CL_MAP_READ), and use the resultant pointer to populate the new cl_mem with my data. An event from the clEnqueueMapBuffer call is used to kick off a clEnqueueUnmapMemObject straight afterwards, and similarly, an event from the clEnqueueUnmapMemObject is used in the event_wait_list of the kernel launch straight after that.
Code below, sans error handling.

All fairly straightforward, but timing is a problem. My VTune trace shows the clEnqueueMapBuffer taking 770us in the queue before a 0.3us compute. Then the clEnqueueUnmapMemObject takes 130us followed by a similarly negligible 0.3us compute time. However since my kernel takes only 400us of compute, clEnqueueMap/UnmapBuffer queuing is taking a disproportionate part of the overall time. Am I just in the noise and overheads with such small function times, or can I improve this at all?

 

		tile_size_bytes = BITMAP_NON_TEXTURED_SIZE_PER_TILE_OPENCL * sizeof(unsigned char);

		input_buffer_cl_mem =
		clCreateBuffer(oclInstance->context,
			CL_MEM_ALLOC_HOST_PTR,
			tile_size_bytes,
			NULL,&errcode_ret);


		// later....

        // map the buffer

        mapped_tile_buffer = clEnqueueMapBuffer(
            oclInstance->queue,
            input_buffer_cl_mem,
            CL_FALSE,
            CL_MAP_READ,
            0,
            tile_size_bytes,
            0,
            NULL,&writeTileEvent,&errcode_ret);

        // copy strided data into mapped buffer

        thisTileStart = tileStrip + (tileCount * BITMAP_TILE_WIDTH_PIXELS * CHANNEL_COUNT_OPENCL);
        destride_incoming_bitmap_tile_in_strip_into_buffer(thisTileStart, bitmap_step, (char *)mapped_tile_buffer);
")

        // and unmap

        errcode_ret = clEnqueueUnmapMemObject(
            oclInstance->queue,
            input_buffer_cl_mem,
            mapped_tile_buffer,
            1,&writeTileEvent,&unmapEvent);


		// (kernel params already set up...

		size_t globalSize[3];
		size_t localSize[] = { THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, 1 };  // blocks are default 64 x 8
		size_t globalSizeWorkgroups[] = { ARR_IMAGE_TILE_WIDTH / THREAD_BLOCK_TILE_WIDTH_IN_PIXELS,			// 256/64 = 4
			ARR_IMAGE_TILE_HEIGHT / THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS,										// 256/8 = 32
			CHANNEL_COUNT_OPENCL };																			// 3

		globalSize[0] = globalSizeWorkgroups[0] * localSize[0];
		globalSize[1] = globalSizeWorkgroups[1] * localSize[1];
		globalSize[2] = globalSizeWorkgroups[2] * localSize[2];

		errcode_ret = clEnqueueNDRangeKernel(
		oclInstance->queue,								                            // command queue
		acej_kernel_tile_ifdct_cpuhuffman,											// kernel
		3,														                    // work_dim
		0,														                    // global_work_offset
		globalSize,												                    // global_work_size
		localSize,					                                                // local_work_size :  localSize or NULL
		1,														                    // num_events_in_wait_list&unmapEvent,													            // event_wait_list
		(eventList + tileCount));											        // event

 

 

 

 


Viewing all articles
Browse latest Browse all 1182

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>