Quantcast
Channel: Intel® Software - OpenCL*
Viewing all 1182 articles
Browse latest View live

clEnqueueMap/UnmapBuffer overheads

$
0
0

Can someone please straighten me out on expected clEnqueueMapBuffer overheads under Haswell?

Environment: Windows 7 sp1, VS2013, i7-4770, driver 10.18.14.4170

I have my own 768kb buffer which needs to be accessed by the HD 4600 GPU.
I clCreateBuffer with CL_MEM_ALLOC_HOST_PTR, which I believe sets aside some pinned memory for later use. Later on I use clEnqueueMapBuffer (with CL_MAP_READ), and use the resultant pointer to populate the new cl_mem with my data. An event from the clEnqueueMapBuffer call is used to kick off a clEnqueueUnmapMemObject straight afterwards, and similarly, an event from the clEnqueueUnmapMemObject is used in the event_wait_list of the kernel launch straight after that.
Code below, sans error handling.

All fairly straightforward, but timing is a problem. My VTune trace shows the clEnqueueMapBuffer taking 770us in the queue before a 0.3us compute. Then the clEnqueueUnmapMemObject takes 130us followed by a similarly negligible 0.3us compute time. However since my kernel takes only 400us of compute, clEnqueueMap/UnmapBuffer queuing is taking a disproportionate part of the overall time. Am I just in the noise and overheads with such small function times, or can I improve this at all?

 

		tile_size_bytes = BITMAP_NON_TEXTURED_SIZE_PER_TILE_OPENCL * sizeof(unsigned char);

		input_buffer_cl_mem =
		clCreateBuffer(oclInstance->context,
			CL_MEM_ALLOC_HOST_PTR,
			tile_size_bytes,
			NULL,&errcode_ret);


		// later....

        // map the buffer

        mapped_tile_buffer = clEnqueueMapBuffer(
            oclInstance->queue,
            input_buffer_cl_mem,
            CL_FALSE,
            CL_MAP_READ,
            0,
            tile_size_bytes,
            0,
            NULL,&writeTileEvent,&errcode_ret);

        // copy strided data into mapped buffer

        thisTileStart = tileStrip + (tileCount * BITMAP_TILE_WIDTH_PIXELS * CHANNEL_COUNT_OPENCL);
        destride_incoming_bitmap_tile_in_strip_into_buffer(thisTileStart, bitmap_step, (char *)mapped_tile_buffer);
")

        // and unmap

        errcode_ret = clEnqueueUnmapMemObject(
            oclInstance->queue,
            input_buffer_cl_mem,
            mapped_tile_buffer,
            1,&writeTileEvent,&unmapEvent);


		// (kernel params already set up...

		size_t globalSize[3];
		size_t localSize[] = { THREAD_BLOCK_TILE_WIDTH_IN_PIXELS, THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS, 1 };  // blocks are default 64 x 8
		size_t globalSizeWorkgroups[] = { ARR_IMAGE_TILE_WIDTH / THREAD_BLOCK_TILE_WIDTH_IN_PIXELS,			// 256/64 = 4
			ARR_IMAGE_TILE_HEIGHT / THREAD_BLOCK_TILE_HEIGHT_IN_PIXELS,										// 256/8 = 32
			CHANNEL_COUNT_OPENCL };																			// 3

		globalSize[0] = globalSizeWorkgroups[0] * localSize[0];
		globalSize[1] = globalSizeWorkgroups[1] * localSize[1];
		globalSize[2] = globalSizeWorkgroups[2] * localSize[2];

		errcode_ret = clEnqueueNDRangeKernel(
		oclInstance->queue,								                            // command queue
		acej_kernel_tile_ifdct_cpuhuffman,											// kernel
		3,														                    // work_dim
		0,														                    // global_work_offset
		globalSize,												                    // global_work_size
		localSize,					                                                // local_work_size :  localSize or NULL
		1,														                    // num_events_in_wait_list&unmapEvent,													            // event_wait_list
		(eventList + tileCount));											        // event

 

 

 

 


How to link to OpenCL binary library created with clLinkProgram "-create-library"

$
0
0

Can someone please tell me how to link to the program library created in the following way:

lib = clLinkProgram(context, NULL, NULL, "-create-library ", 1, &prog, NULL, NULL, &err);

I can happily produce this object for our static OpenCL library functions at start up but there seems to be no way to link to it with our dynamically generated kernel program.

I'm not sure exactly how we are supposed to use the OpenCL binary library type created as above as it is rejected by any further attempts to link to compiled kernel programs with CL_LINK_FAILURE.

Any advice much appreciated - thanks!

Which SDK do I need to download ? (

$
0
0

 

Hi, Im running on Intel® Core™ i5-480M Processor , which has internal GPU:

Chip Type: intel(R) HD Graphics (Core i5)

Dac Type: Internal

Adapter String:  Intel(R) HD Graphics

Bios Information: Intel Video Bios

 

I want to start study & programming with opencl.

Which SDK can I install ? (if any ?)

Thanks

 

Precompiling binaries without Intel hardware

$
0
0

Hi,

I have a project using OpenCL code.

As part of the packaging we embed the binaries for the OpenCL platforms we want to support.

Unfortunatelly without the Intel HD present on the machine we did not find a way to precompile cl files "offlines" (for instance with clcc or directly with OpenCL API calls) for the Intel HD Graphics target.

Is that kind of "cross-compilation" supported ? Otherwise the product won't be able to be entirely built on the same machine...

Best Regards

No GPU for OpenCL on P4600?

$
0
0

Greetings

I'm probably just missing something simple, but I'm having an issue with using the GPU with OpenCL under 64-bit Windows 7 on an Intel Z230 with P4600 graphics.  I installed Intel INDE 2015 Update 2, and in Visual Studio 2013, I created an OpenCL application via the CodeBuilder Project for Windows.  I just selected Finish, accepting all default options, which specifies GPU acceleration for OpenCL.  When I run the resulting application, I get the error "clGetDeviceIDs() returned CL_DEVICE_NOT_FOUND" as no GPU device is located.  If I replace the desired device type to be CL_DEVICE_TYPE_CPU instead of CL_DEVICE_TYPE_GPU, then the Add code example runs fine.

According to Device Manager, my graphics driver is 10.18.14.4170, and I ran the automatic driver updater to update it, where it detected 15.36.21.64.4222 was current and it determined an update was needed.  The update appeared to be successful, but Device Manager still showed me 10.18.14.4170.  I uninstalled that driver via the control panel so no explicit driver was loaded, and did another update (with the update app telling me I was now updating from 6.something instead of 10.18.14.4170), but after a reboot the 10.18.14.4170 version reappeared.  Is that ultimately the source of my problem, or does driver 15.36.21.64.4222 in Intel numbering appear as 10.18.14.4170 in device manager?  The release notes form INDE indicate that 15.33.3 or higher is needed for OpenCL, but I'm not sure if that is what I have or not.

Any thoughts would be appreciated - thanks!

Intel OpenCL with an NVidia GPU?

$
0
0

Hi all,

I saw an earlier post saying that you cant run programs on an Nvidia GPU using Intel OpenCL
https://software.intel.com/en-us/forums/topic/370060

So I just wanted to ask whether this is still true or not. (e.g. I would like to run the General Matrix Multiply Example on my Nvidia GTX 970)

Thanks

Maximum memory for GEN is only 80 MB ?

$
0
0

Is the maximum size for OpenCL only 80 MB ?

First, CPU allocates the memory using aligned_malloc().
CPU then does some processing on the memory, then OpenCL kernel is executed on GEN GPU.
Before OpenCL execution on GEN, clCreateBuffer() is executed using CL_MEM_USE_HOST_PTR flag.
OpenCL kernel on GEN will do some stuff and write results on the memory, then CPU will postprocess the memory.

For some reason, if the memory allocated is more than around 80 MB, then the application does not execute properly.
Is there an inherent limitation of GPU memory ? Can't OpenCL on GEN GPU use more than 80 MB ? I checked that it is not algorithm problem, it's just that memory is not working properly.

Could you please let me know if there is this inherent limitation ? If yes, how I can bypass it ? I think that many applications are surely using more than 80 MB these days.

Thank you

PS. I'm using HSW processor on Windows 8.1, using visual studio 2012

Shared memory vs Texture memory

$
0
0

I am writing deinterlacing code in Opencl. I am reading the pixels using read_imageui() API in the local memory.

Just like the code at: https://opencl-book-samples.googlecode.com/svn-history/r29/trunk/src/Chapter_19/oclFlow/lkflow.cl

As per my understanding when we read pixels using this API we are reading from the Texture memory. I am doubtful that using the pixels first in shared memory will help me gaining any speed as Texture memory already acts as cache and provides fast access to data.

Can anyone clarify my doubt ?


Wrong calculation results on Intel(R) HD Graphics 4600

$
0
0

Hi,

I've found what seems to be a bug in the OpenCL driver for Intel HD 4600 graphics core. The code I am attaching produces correct results on Intel CPU or NVidia GPU OpenCL targets but at the same time produces wrong results on HD 4600.

The code is self-diagnosing. It tells you if it produces correct or incorrect results. Here is the output I am getting on HD 4600:

Device: Intel(R) HD Graphics 4600
Expected value at center: 1.06103
Computed value at center: 0.000547921
MSE: 0.0778968 [FAIL]
 

After a small modification to select a CPU device it produces the following:

Device: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
Expected value at center: 1.06103
Computed value at center: 1.06335
MSE: 7.60111e-005 [PASS]

I am compiling the code with Visual Studio 2013 using Khronos OpenCL headers (.h files from OpenCL 1.1 and cl.hpp from OpenCL 1.2) and linking against opencl.lib from NVidia's SDK. It shouldn't matter though.

I am running Windows 8.1 with the latest driver 10.18.14.4222.

 

 

AttachmentSize
Downloadtest.cpp8.13 KB

Power consumption (Watt)

$
0
0

Hi,

I look for a software measurement tool or an API I can include to know my kernel power consumption during execution on GPU. How can I do that?

Thanks

two similar programs: one finds the gpu, one doesn't.

$
0
0

Hi all,

      I'm having issues after upgrading from 10.18.10.3621 to 10.18.14 4206, on Lenovo x240 Haswell machine, Win8 and visual studio 2012 with the matching vc++ compiler 11, debug win32 config.

      The surprising part is that one openCL app doesn't have any issues, while another, that accepts the previous as a DLL, while

     running the same code that supposed to find the GPU devices, doesn't find the GPU. Both ran on the same machine, and the second got broken after installing the latest gfx driver. 
      Here's a small example, which is exactly the same between the two applications:
 

 vector<cl_platform_id> platform = getClPlatforms(); //a wrapper for clGetPlatformIDs
    bool foundPlatform = ! types.size(); // in case the list is empty we take the first platform

    vector<cl_device_id> devIDFinal;
    for (auto plat : platform)
    {
        cl_char version[256];
        clGetPlatformInfo(plat, CL_PLATFORM_VERSION, sizeof(version), & version, NULL);

        bool typeNotFound = false;
        devIDFinal.clear();

        vector<cl_device_id> devID = getClDevices(plat);

 

      On the two different rans, "opencl 1.2" version is found, but getClDevices returns different result for that platform.
     

      I've looked at the libs and dlls being loaded during both runs, and unless I've missed something, all of the headers/libs match.
      

      Any guess?

      P.S. I'm not sure if related, but One major change between before updating the gfx driver and after is the cpu-only experimental opencl2.0...

Thanks!

Danny.

 

 

Linux: clCreateContext fails on embedded Linux device

$
0
0

Hello everybody,

I'm in the process of evaluating the combination of the Intel Media SDK + OpenCL on Linux (using the latest Release of 'MediaServerStudioEssentials2015R6'). For my tests I'm using two Intel NUC5i3RYK. The first NUC is running CentOS 7.1 as described in the 'Intel Media Server Studio 2015 Getting Started Guide'. The second NUC is running a custom embedded Linux created using http://buildroot.uclibc.org/.

The good news:

The NUC running CentOS 7.1 works perfectly fine! I  can use all the Intel Media SDK examples, even those that integrate OpenCL via the plugin mechanism. And the OpenCL examples are running on the GPU rather than CPU, which is perfect!

The bad news:

Our custom embedded Linux is based on the requirements mentioned within the Getting Started Guide, i.e. we are using Kernel 3.14.5 and applied all patches from the Media Server Studio package. We are also using libva and libdrm from the Media Server Studio package. We have been using this setup for quite some time now, and the Intel Media SDK (used for Video Transcoding) is working fine!

But for some reason OpenCL won't work, more precisely calling the clCreateContext function fails with error code -5 (CL_OUT_OF_RESOURCES).

...

// Code from opencl_filter_va.cpp provided by the Media Samples
cl_context_properties props[] = { CL_CONTEXT_VA_API_DISPLAY_INTEL, (cl_context_properties) m_vaDisplay, CL_CONTEXT_INTEROP_USER_SYNC, 1, 0};
m_clcontext = clCreateContext(props, 1, &m_cldevice, NULL, NULL, &error);

...

From my point of view the only difference between CentOS and our Embedded Linux is the Kernel version (3.10.0-229.1.2.39163.MSSr4.el7.centos.x86_64 vs. 3.14.5). We are also not using i915 as a separate kernel module (i915.ko), but instead the module is built into the kernel. But this should not be the reason I guess.

I'm aware that our setup is no official Linux distribution supported by Intel, but perhaps anyone had the same issues I have right now? So I wanted to ask if anyone has ever succeeded in using the generic Kernel 3.14.5 + libva/libdrm sources from the Media Server Package to create a working setup? Or is there any way to get some debug messages out of the OpenCL implementation that might help me to find the reason why OpenCL ain't working?

Greetings from Germany,

Frédéric

 

Random memory read performance difference between GPU and CPU (I7-4770R)?

$
0
0

We are running a simple code doing random reads and sequential write (i.e. gather operation) on both the CPU and GPU part of the I7-4770R (separately, one at a time) and experiencing 4x slower performance on the GPU compared to the CPU. When doing sequential reads and writes and even random writes, the performance is very similar indicating that both the internals of the chip as well as the memory controller allows the GPU to access the DRAM with the same speed the CPU does. However have no idea why random reads suffer a 4x performance penalty and this limits our application’s performance quite a lot. Would be good to know what the reason of this performance difference is and see whether there is some remedy for it.

Here are also the numbers from our experiments. The metric is execution time, so the lower the better.

 

MAP

REDUCE

GATHER

SCATTER

Intel i-4770r IrisPro-16G mem-4 Cores-OpenMP-CPU

24.73

13.65

36.34

231.67

Intel i-4770r IrisPro-16G mem-40 EU-OpenCL-GPU

23.55

16.29

167.03

270.7

Visual Studio Community 2013 crashes when building program from Code-builder

$
0
0

I am using Visual Studio Community 2013 Version 12 Update 4 and OpenCL Code Builder 1.4.0.25 on 64-bit laptop (Windows 8.1, Intel Core i5-5200U and GPU HD Graphics 5500).

I have created a Code Builder project for Windows in C++ and I can build (Configuration: Debug Win32) and run it without any problem.

I have created a session, and Visual Studio always crashes when I try to use following options

  1. build program and compile program from Code-builder menu
  2. build session and compile session from contextual menu

What should I do to make Code Builder work?

 

 

Memcpy performance using opencl kernel

$
0
0

Hi,

I have written a simple memcpy kernel as written below:

I am analyzing its performance on GPU using vtune.

__kernel void deinterlace_Y(__read_only image2d_t YIn, __write_only image2d_t YOut)
{

/* Doing operation of Memcpy */

int2 coord_src = (int2)(get_global_id(0), get_global_id(1));

const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE;

uint4 pixel4 = read_imageui(YIn, smp, coord_src);

write_imageui(YOut, coord_src, pixel4);

}

I observe the below stats for Execution units:

    EU Array

Active Stalled Idle

24.6% 18.1% 57.2%

Also my computing threads started number is 24,525,023, which is quite high.I don't know how to reduce the number of threads started here and result in increased performance.

I can't understand how to improve its performance. I have gone through this link on optimizationshttps://int2-software.intel.com/en-us/articles/optimizing-simple-opencl-kernels. At this link all the optimizations are related to buffers where we can read 16 elements from memory in one go. But in my case since I am using Texture memory reads or image API's I don't know the way to increase the performance


Bug in OpenCL Code Builder's Debugger

$
0
0

To start this post off, I would like to say that I do not take the term "bug" lightly - it gets thrown around on the internet often, but usually when someone does not understand the tool set. I hope this post is not one of those, but my research tells me otherwise. TL;DR clause: there appears to be a problem with get_global_size(dim) from the debugger.

Lets start by defining a simple kernel (emphasis on simple) - set the value of the array element in an in/out buffer to its associated linear index:

__kernel void test_kernel(__global int* buffer)
{
    int2 pos = {get_global_id(0), get_global_id(1)};
    int2 gs = {get_global_size(0), get_global_size(1)};
    int i = pos.x + pos.y*gs.x;

    buffer[i] = i;
}

In Intel OpenCL Code Builder (64-bit), version 1.4.0.25, if we run the kernel in the analyzer with a global size (x,y,z) of 1024,1024,0, this works as expected. The buffer is of size 1024*1024 and gets the correct associated index - so the runtime seems OK.

If you run the same thing in the debugger (see below), the issue arises - one would expect per the standard that get_global_size() will return 1024, right?

Not quite. If we start the debugger and go to global work item 1023x1023 (the 'lower right' if you want to think of it geometrically in 2D with dimension 0 as the x-axis and 1 as the y-axis), running up until the last instruction, we observe that the global sizes are flat wrong:

In the above, I set the local size to 16x16x0. Everything appears right in the debugger selections, I can select item 1023,1023 and I get local work group 63,63 item 15,15. But the global size is 16 (???), leading to incorrect buffer indexing.

Other notes: Sometimes when I run this the application in the 64-bit version it crashes and I have never had luck in 32-bit mode. Because this has been such a great learning tool for OpenCL up to this point, I would like to recommend it for use in the workspace… but realistically can’t with such a blatant deficiency (nothing against anyone really, this kind of thing happens).

Can someone let me know if I am doing anything wrong, or perhaps take a look into the matter?

Error when I run occupancy for kernels

$
0
0

I am using Visual Studio Community 2013 Version 12 Update 4 and OpenCL Code Builder 1.4.0.25 on 64-bit laptop (Windows 8.1, Intel Core i5-5200U and GPU HD Graphics 5500).

When I try to run occupancy for kernels, I get a script error. On line 2863, character 98,  ":" was expected.

How can I fix the problem?

Thanks in advance

 

 

Which tools are needed to develop openCl code for Intel Iris Pro 6200 Graphics

$
0
0

I want to develop openCl code that will work on Intel Iris Pro 6200 or 6300. I want to know the tools that support openCl for this device.

clSetEventCallback Oddity

$
0
0

I suspect there's an aspect of clSetEventCallback that I'm misunderstanding. If you have a look at this simplified code below, you can see a kernel call, followed by a clSetEventCallback. The program then sits and waits on a windows event (inside the callback) to be set.
If the clSetEventCallback is immediately followed by a clFlush(), then the callback is called and the wait released as expected. However without the clFlush, the kernel is never called, nor its callback, and the wait is eternal. What am I missing here? 

Environment: i7-4770, Intel OpenCL SDK 4.6, windows 7 sp1, visual studio vs2013, driver 15.36.19

#include "CL\cl.h"

#include <windows.h>
#include "stdafx.h"

int initialiseEnvironment(
    char *KernelSource,
    cl_device_id *device_id,
    cl_context *context,
    cl_command_queue *commandqueue,
    cl_program *program
    );

void __stdcall kernel_complete_callback(cl_event complete_event, cl_int cmd_sts, void *user_data)
{

    HANDLE* bufferEventHandle = (HANDLE*)user_data;

    if (!SetEvent(*bufferEventHandle))
        printf("callback fail");
}


int _tmain(int argc, _TCHAR* argv[])
{

    DWORD dwWaitResult;
    cl_int errcode_ret = CL_SUCCESS;

    char *KernelSource = "\n" \"__kernel void debugTest() \n" \"{  \n" \"  printf(\"here\"); \n" \"}  \n" \"\n";

    cl_device_id device_id;
    cl_context context;
    cl_command_queue commandQueue;
    cl_program program;

    // set up the opencl basics

    initialiseEnvironment(
        KernelSource,
        &device_id,&context,&commandQueue,&program);

    cl_kernel debugTest = clCreateKernel(program, "debugTest", &errcode_ret);
    if (CL_SUCCESS != errcode_ret) return 0;

    HANDLE bufferCompleteEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
    if (bufferCompleteEvent == NULL) return 0;

    if (!ResetEvent(bufferCompleteEvent))return 0;

    cl_event kernelCompleteEvent;

    size_t threadSize = 1;

    errcode_ret = clEnqueueNDRangeKernel(
        commandQueue,
        debugTest,
        1,
        0,
        &threadSize,
        NULL,
        0,
        NULL,&(kernelCompleteEvent));

    if (CL_SUCCESS != errcode_ret)
        return 0;

    errcode_ret = clSetEventCallback(kernelCompleteEvent, CL_COMPLETE, &kernel_complete_callback, &bufferCompleteEvent);

    if (CL_SUCCESS != errcode_ret)
        return 0;

    clFlush(commandQueue);       // if commented, then program never gets beyond wait on next line

    dwWaitResult = WaitForSingleObject(bufferCompleteEvent, INFINITE);

    if (dwWaitResult != WAIT_OBJECT_0)
        return 0;

    return 0;
}


int initialiseEnvironment(
    char *KernelSource,
    cl_device_id *device_id,
    cl_context *context,
    cl_command_queue *commandqueue,
    cl_program *program
    )
{

    int err;
    cl_uint platformCount;

    clGetPlatformIDs(0, NULL, &platformCount);
    cl_platform_id *platforms = (cl_platform_id*)malloc(sizeof(cl_platform_id) * platformCount);
    clGetPlatformIDs(platformCount, platforms, NULL);

    unsigned int selectedPlatform = -1;
    for (unsigned int i = 0; i < platformCount; i++) {

        char* value;
        size_t size = 0;
        clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, size, NULL, &size);
        value = (char*)malloc(sizeof(char) * size);
        clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, size, value, NULL);

        if (strcmp(value, "Intel(R) OpenCL") == 0) {
            selectedPlatform = i;
            break;

        }
    }

    unsigned int deviceCount;

    err = clGetDeviceIDs(platforms[selectedPlatform], CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);

    if (CL_SUCCESS != err)
    {
        printf("Error: Failed to clGetDeviceIDs, returned\n");
        return 0;
    }

    cl_device_id *devices = (cl_device_id*)malloc(sizeof(cl_device_id) * deviceCount);

    clGetDeviceIDs(platforms[selectedPlatform], CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);

    if (CL_SUCCESS != err)
    {
        printf("Error: Failed to clGetDeviceIDs\n");
        return 0;
    }

    *device_id = devices[0];

    *context = clCreateContext(0, 1, device_id, NULL, NULL, &err);

    if (!context)

    {
        printf("Error: Failed to create a compute context!\n");
        return 0;
    }

    *commandqueue = clCreateCommandQueue(*context, *device_id, 0, &err);

    if (!commandqueue)

    {
        printf("Error: Failed to create a command commands!\n");
        return 0;
    }

    *program = clCreateProgramWithSource(*context, 1, (const char **)& KernelSource, NULL, &err);

    if (!program)

    {
        printf("Error: Failed to create compute program!\n");
        return 0;
    }

    err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);

    if (err != CL_SUCCESS)
    {
        // Determine the reason for the error
        char buildLog[16384];
        clGetProgramBuildInfo(*program, *device_id, CL_PROGRAM_BUILD_LOG,
            sizeof(buildLog), buildLog, NULL);
        printf("Error in program: %s", buildLog);
        clReleaseProgram(*program);
        return 0;
    }
    return 1;
}

 

 

Unable to post because of spam filter

$
0
0

My posts are all spam filtered - unable to post.
 

Viewing all 1182 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>