Can someone please straighten me out on expected clEnqueueMapBuffer overheads under Haswell?
Environment: Windows 7 sp1, VS2013, i7-4770, driver 10.18.14.4170
I have my own 768kb buffer which needs to be accessed by the HD 4600 GPU.
I clCreateBuffer with CL_MEM_ALLOC_HOST_PTR, which I believe sets aside some pinned memory for later use. Later on I use clEnqueueMapBuffer (with CL_MAP_READ), and use the resultant pointer to populate the new cl_mem with my data. An event from the clEnqueueMapBuffer call is used to kick off a clEnqueueUnmapMemObject straight afterwards, and similarly, an event from the clEnqueueUnmapMemObject is used in the event_wait_list of the kernel launch straight after that.
Code below, sans error handling.
All fairly straightforward, but timing is a problem. My VTune trace shows the clEnqueueMapBuffer taking 770us in the queue before a 0.3us compute. Then the clEnqueueUnmapMemObject takes 130us followed by a similarly negligible 0.3us compute time. However since my kernel takes only 400us of compute, clEnqueueMap/UnmapBuffer queuing is taking a disproportionate part of the overall time. Am I just in the noise and overheads with such small function times, or can I improve this at all?
tile_size_bytes = BITMAP_NON_TEXTURED_SIZE_PER_TILE_OPENCL * sizeof(unsigned char); input_buffer_cl_mem = clCreateBuffer(oclInstance->context, CL_MEM_ALLOC_HOST_PTR, tile_size_bytes, NULL,&errcode_ret); // later.... // map the buffer mapped_tile_buffer = clEnqueueMapBuffer( oclInstance->queue, input_buffer_cl_mem, CL_FALSE, CL_MAP_READ, 0, tile_size_bytes, 0, NULL,&writeTileEvent,&errcode_ret); // copy strided data into mapped buffer thisTileStart = tileStrip + (tileCount * BITMAP_TILE_WIDTH_PIXELS * CHANNEL_COUNT_OPENCL); destride_incoming_bitmap_tile_in_strip_into_buffer(thisTileStart, bitmap_step, (char *)mapped_tile_buffer); ") // and unmap errcode_ret = clEnqueueUnmapMemObject( oclInstance->queue, input_buffer_cl_mem, mapped_tile_buffer, 1,&writeTileEvent,&unmapEvent); // (kernel params already set up... size_t globalSize[3]; size_t localSize[] = { THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, 1 }; // blocks are default 64 x 8 size_t globalSizeWorkgroups[] = { ARR_IMAGE_TILE_WIDTH / THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, // 256/64 = 4 ARR_IMAGE_TILE_HEIGHT / THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, // 256/8 = 32 CHANNEL_COUNT_OPENCL }; // 3 globalSize[0] = globalSizeWorkgroups[0] * localSize[0]; globalSize[1] = globalSizeWorkgroups[1] * localSize[1]; globalSize[2] = globalSizeWorkgroups[2] * localSize[2]; errcode_ret = clEnqueueNDRangeKernel( oclInstance->queue, // command queue acej_kernel_tile_ifdct_cpuhuffman, // kernel 3, // work_dim 0, // global_work_offset globalSize, // global_work_size localSize, // local_work_size : localSize or NULL 1, // num_events_in_wait_list&unmapEvent, // event_wait_list (eventList + tileCount)); // event