Quantcast
Channel: Intel® Software - OpenCL*
Viewing all 1182 articles
Browse latest View live

Events not getting their ref count incremented

$
0
0

When I pass an array of 2 wait events into clEnqueueReadBuffer, with one event created manually, the ref counts are not incremented.

As I understand, all events passed into cl methods should get their ref counts incremented.

This works correctly with my GPU, but not for CPU (WIndows 7 64 bit,  core i7 3770, latest Intel OpenCL driver for CPU)


OpenCL GPU access on Windows over RDP/Remote Desktop

$
0
0

I would like to be able to detect and access the Intel GPUs using OpenCL even over RDP/Remote Desktop sessions so that I can deploy applications that can utilize the full system capabilities even without direct console access. Right now, it appears that I would have to switch to use an alternate remote access client like VNC to keep the Intel video drivers loaded and keep the GPU accessible.

Having the GPU available over RDP would help with systems that are being remotely managed and crunching large amounts of data that can benefit from GPUs. Being able to use RDP allows for a login system that fits well with corporate login systems and doesn't need an additional remote client access installation.

debug kernel with inde

$
0
0

hello. I wish find a sample for debug kernels with intel inde, but i can't find any binars or tutorials. is possible use inde for debugs of the opencl kernels ? Can you advices to me a sample or a tutorial or better a binary? thanks.

I use inde opencl visual studio 13 addin

ps.sorry for my english

Device-side Enqueue to data structure in accordance with a condition

$
0
0

Hi,

i would like to port the following algorithm to OpenCL, and it seems device-side enqueues can help me to improve performace.

I want to count how many times happen the following fact. At the i step, I pick up i numbers, sum them and check if it is bigger than a threshold. If it is bigger than the threshold, I move to step i+1 until I get to final stage. For instance, at step number 4, I will pick 4 numbers, sum them and check if it is bigger than a threshold. If it is bigger, I will repeat the process picking up 5 numbers. If I get to final step F, I count one. I run this small piece of code N times. Real algorithm classifies pixels of an image into two categories (pixels that pass all steps and pixels that can not pass all steps) where N is indeed width*height of the image.

Characteristics of the algorithm:

  • every new step is more computationally expensive (I pick up more numbers), but is computed less times
  • time of computation of each step could be easily calculated
  • unknown when the sum of numbers will not be bigger than the threshold

I have tried the following naïve algorithm in OpenCL:

As I need to run the algorithm N times, I launch N kernels and each thread is in charge of the whole process. Main problem is unbalance workload between threads. Two threads of the warp can end at different stages, (one thread at stage 2, but, another one, at stage 10), so most of the threads are idle as they need to wait for other threads.

So far, I have different ideas:

  • Launch from host N kernels that takes care of j steps (where j is much smaller than the total number of steps), compact threads that should still working and launch them again from host. Launch and compact threads until I get to final step.
  • Launch from host N kernels that takes care of j steps and continue later on CPU. CPU will be in charge of last computationally expensive steps.
  • Launch from host N kernels, and each kernel from host will launch mini kernels that will be in charge of only one step.

If I use nested parallelism, will I increase the occupancy of my threads? Which solution should have better performance? Will a kernel that launches only one mini kernel need to wait a kernel that launches many of them?

Thanks in advance

clBuildProgram crashes for Xeon Phi

$
0
0

Hi, 

I have installed a Xeon Phi on CentOS 6.6. When running my OpenCL application on Intel SDK it works for CPU but crashes when I select the Xeon Phi which shows as "Intel(R) Many Integrated Core Acceleration Card". libOpenCL points to "/opt/intel/opencl/lib64/libOpenCL.so" on my system. 

Please advise. 

 

Intel Gen8 architecture calculating total kernel instances per execution unit

$
0
0

I am taking the reference from the intel_gen8_arch

Few sections are causing confusion in my understanding for SIMD engine concept.

5.3.2 SIMD FPUs Within each EU, the primary computation units are a pair of SIMD floating-point units (FPUs). Although called FPUs, they support both floating-point and integer computation. These units can SIMD execute up to four 32-bit floating-point (or integer) operations, or SIMD execute up to eight 16-bit integer or 16-bit floating-point operations. Each SIMD FPU can complete simultaneous add and multiply (MAD) floating-point instructions every cycle. Thus each EU is capable of 16 32-bit floating-point operations per cycle: (add + mul) x 2 FPUs x SIMD-4.

The above lines of the documents clearly states the maximum floating point operations that can be done on each Execution Unit.

First Question: I think it is referring to per hardware thread of Execution unit than the whole execution unit. Am I right here?

In section 5.3.5 it mentions On Gen8 compute architecture, most SPMD programming models employ this style code generation and EU processor execution. Effectively, each SPMD kernel instance appears to execute serially and independently within its own SIMD lane. In actuality, each thread executes a SIMD-Width number of kernel instances concurrently. Thus for a SIMD-16 compile of a compute kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executing concurrently on a single EU. Similarly, for a SIMD-32 compile of a compute kernel, 32 x 7 threads = 224 kernel instances could be executing concurrently on a single EU.

Now this section illustration seems contradicting with the section 5.3.2.

Specifically, 1) Since it says each HW thread of EU has 2, SIMD-4 units then how SIMD-32 works. How are we reaching to calculation of 224 on 7 threads. DO we combine some hardware threads?

Also, How we compile the kernel in SIMD-16 or SIMD-32 mode?

clCreateKernel fails with CPU selected

$
0
0

Hi, i have simple kernel.

When i use intel offline compiler ioc32.exe  clCreateKernel fails with error code INVALID_VALUE

When I recompile the kernel for Intel 4600 GPU clCreateKernel succeeds.

How i can obtain more debug information why it fails on the cpu?

 

 

Best utilizing the Intel Iris 5200 architecture using OPENCL

$
0
0

Hi Robert Loffe,

Need your help here!

I am using Intel Iris 5200 GPGPU and Inel i7 4770R processor, Windows 8.1 as OS.

I am optimizing my below code snippet on Intel Iris:

global_size is 1920x1080
local size is kept NULL. I have left this to compiler.

__kernel void experiment(__read_only image2d_t YIn, __write_only image2d_t YOut)
{

  uint4 best_suited=0;
  uint4 temp =0;
  int best_sum,ssum;

    int2 coord_src = (int2)(get_global_id(0), 2*get_global_id(1)+1);
    const sampler_t smp = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE;


      uint4 pixel1 = read_imageui(YIn, smp, coord_src + (int2)(-3,0));
      uint4 pixel2 = read_imageui(YIn, smp, coord_src + (int2)(-2,0));
      uint4 pixel3 = read_imageui(YIn, smp, coord_src + (int2)(-1,0));
      uint4 pixel4 = read_imageui(YIn, smp, coord_src + (int2)( 0,0));
      uint4 pixel5 = read_imageui(YIn, smp, coord_src + (int2)( 1,0));
      uint4 pixel6 = read_imageui(YIn, smp, coord_src + (int2)( 2,0));
      uint4 pixel7 = read_imageui(YIn, smp, coord_src + (int2)( 3,0));

      /* Read luma pixels of next line */
      uint4 pixel_nxt1 = read_imageui(YIn, smp, coord_src + (int2)(-3,2));
      uint4 pixel_nxt2 = read_imageui(YIn, smp, coord_src + (int2)(-2,2));
      uint4 pixel_nxt3 = read_imageui(YIn, smp, coord_src + (int2)(-1,2));
      uint4 pixel_nxt4 = read_imageui(YIn, smp, coord_src + (int2)( 0,2));
      uint4 pixel_nxt5 = read_imageui(YIn, smp, coord_src + (int2)( 1,2));
      uint4 pixel_nxt6 = read_imageui(YIn, smp, coord_src + (int2)( 2,2));
      uint4 pixel_nxt7 = read_imageui(YIn, smp, coord_src + (int2)( 3,2));

    /* main loop: */
    {

      best_sum= abs_diff(pixel3.x,pixel_nxt4.x) + abs_diff(pixel4.x,pixel_nxt5.x) + abs_diff(pixel5.x,pixel_nxt6.x) -8;
      best_suited.x = (pixel4.x+pixel_nxt2.x) >> 1;


      sum = abs_diff(pixel2.x,pixel_nxt2.x) + abs_diff(pixel3.x,pixel_nxt6.x) + abs_diff(pixel4.x,pixel_nxt1.x);

      if (sum < best_sum)
      {

      best_sum = sum;
        best_suited.x = (pixel3.x+pixel_nxt3.x) >> 1;

        sum = abs_diff(pixel1.x,pixel_nxt5.x) + abs_diff(pixel2.x,pixel_nxt6.x) + abs_diff(pixel3.x,pixel_nxt7.x) + 16;

        if (sum < best_sum)
        {
             best_sum = sum;
             best_suited.x = (pixel5.x+pixel_nxt1.x) >> 1;
        }
      }

      sum = abs_diff(pixel4.x,pixel_nxt5.x) + abs_diff(pixel5.x,pixel_nxt2.x) + abs_diff(pixel6.x,pixel_nxt1.x);

      if (sum < best_sum)
      {
       best_sum = sum;
         best_suited.x = (pixel4.x+pixel_nxt3.x)>> 1;

         sum = abs_diff(pixel5.x,pixel_nxt3.x) + abs_diff(pixel6.x,pixel_nxt4.x) + abs_diff(pixel7.x,pixel_nxt3.x);

       if (sum < best_sum)
       {
             best_sum = sum;
             best_suited.x = (pixel6.x+pixel_nxt2.x) >> 1;
           }
      }
    }


      /* Pix4(0,0) is the current pixel in below calculations */
        write_imageui(YOut, coord_src, pixel4);
      /* store the result: */
      write_imageui(YOut, coord_src+(int2)(0,1),best_suited);

}

I have tried the following things: 1) abs_diff is the inbuilt function and by replacing abs_diff with the bitwise code is not giving any improvement.

2) Analysed its performance using intel Vtune and saw execution units are idle for 30% of time. GPU memory read is 7.6GB/sec and write is 3.942GB/sec.Number of L3 cache misses is close to 177x10^9 and Computing Thread are close to 35 lacs. Also Sampler Bottlenecks are 8.3%.

Thinking further: 1) I don't know whether reading the data in local memory will benefit me or not. Since local memory cache line access is same as accessing L3 cache on intel architecture. And reading via image api's I am already accessing the cache memory for image objects i.e. Texture memory. The only help I can think can be reducing sampler bottlenecks if I write code something like this: __local smem [256] ; smem[get_local_id(0) = read_imageui(YIn, smp, coord_src);

2) I also don't know what should be the optimal work group size here.

Can anyone explain me in full detail how this code can be optimized?How can I reduce my execution idle time, computing threads, L3 cache misses and increase my GPU memory read and write access. If you can re-write the code that will be really helpful.


msbuild with visual studio and kernels

$
0
0

Hi i have a kernel that have several includes in it.

Is there automatic way for intel open cl sdk to build them automatically. I have not discovered any. I must add them manually in msbuild. Here is an example.

Nvidia cuda for example has inside GenDepTask that outputs dependencies from the compiler.

ioc.exe does not have anything similar, like cl /showIncludes or gcc -E options.

  <ItemGroup>

    <None Include="..\src\opencl\opencl_stdint.h" />
    <Intel_OpenCL_Build_Rules Include="..\src\opencl\opencl_grayscale.cl">
      <FileType>Document</FileType>
      <AdditionalDependencies Condition="'$(Configuration)|$(Platform)'=='release|Win32'">$(ProjectDir)..\src\opencl\opencl_stdint.h;$(ProjectDir)..\src\opencl\opencl_imaging.h;$(ProjectDir)..\src\opencl\opencl_image_kernel_info.h</AdditionalDependencies>
      <AdditionalDependencies Condition="'$(Configuration)|$(Platform)'=='debug|Win32'">$(ProjectDir)..\src\opencl\opencl_stdint.h;$(ProjectDir)..\src\opencl\opencl_imaging.h;$(ProjectDir)..\src\opencl\opencl_image_kernel_info.h</AdditionalDependencies>
      <Include Condition="'$(Configuration)|$(Platform)'=='release|Win32'">$(ProjectDir)..\src\opencl</Include>
      <Include Condition="'$(Configuration)|$(Platform)'=='debug|Win32'">$(ProjectDir)..\src\opencl</Include>
      <Device Condition="'$(Configuration)|$(Platform)'=='release|Win32'">1</Device>
      <Device Condition="'$(Configuration)|$(Platform)'=='debug|Win32'">1</Device>
      <SPIR32 Condition="'$(Configuration)|$(Platform)'=='debug|Win32'">1</SPIR32>
      <SPIR32 Condition="'$(Configuration)|$(Platform)'=='release|Win32'">1</SPIR32>
    </Intel_OpenCL_Build_Rules>
  </ItemGroup>

 

GPU HD4600 opencl kernel problem

$
0
0

Hi, i am compiling offline spir kernel.

When i use it on HD4600 GPU i get the following when I invoke clBuildProgram

error: IGILTargetLowering::Call(): unhandled function call!

Call made to: _Z13get_global_idj()
0x7c53480: i64 = GlobalAddress<i64 (i32)* @_Z13get_global_idj> 0 [ORD=1]
error: midlevel compiler failed build.

The same kernel works fine on amd gpu and on intel cpu. Also works fine if the kernel is compiled as spir64

kernel void kernel_main(const global read_only uint8_t* rgb_t, global write_only uint8_t* grayscale, const image_kernel_info src, const image_kernel_info dst)
{
    uint32_t x = get_global_id(0);
    uint32_t y = get_global_id(1);

    if (is_in_interior(&src, x, y))
    {
        const global rgb* rgb_ = (const global rgb*) sample_2d_clamp(rgb_t, &src, x, y, sizeof( rgb ) );
        float  r = rgb_->r / 255.0f;
        float  g = rgb_->g / 255.0f;
        float  b = rgb_->b / 255.0f;

        float  gray = 0.2989f * (r  * r) + 0.5870f * (g *  g) + 0.1140f * (b  *  b);
        uint8_t  gray_quantized = (uint8_t ) (sqrt(gray) * 255.0f);

        write_2d_uint8( grayscale, &dst, x, y, gray_quantized);
    }
}

 

 

 

 

 

HD4400 clEnqueueCopyBufferRect issue?

$
0
0

Hello,

we've detected suspicious behaviour of clEnqueueWriteBufferRect/clEnqueueCopyBufferRect functions which is demonstrated with simple test case attached. The test case depends on OpenCL API only. This work correctly on AMD Tahiti but not on Intel HD4400, HD4600.

 

The problem is in copying rectangle of interest with some specific parameters from whole image, which is kept in cl buffer.

 

The short description of test case:

1. create opencl buffer for whole image (not initialized)

2. copy to this buffer data a from aligned host memory rectangle with some specific parameters (using clEnqueueWriteBufferRect)

3. create opencl buffer for size of rectangle used on previous step (not initialized)

4. copy gpu data for the rectangle from whole image cl buffer to this rectangle-only cl buffer (using clEnqueueCopyBufferRect)

5. map rectangle-only cl buffer to host memory

6. compare content of original host data for rectangle with content of this mapped memory. It is expected to be exactly the same but it is not on our environment (HD4400).

 

Below is sample output

 

on AMD system:

 

C:\project\vdudnik\iocl-bug\build\Debug>iocl-bug.exe

AMD Accelerated Parallel Processing

OpenCL 1.2 AMD-APP (1445.5)

Advanced Micro Devices, Inc.

Tahiti

no mismatch found, test PASSED

 

on Intel system:

C:\dev\iocl-bug\build\Debug>iocl-bug.exe

Intel(R) OpenCL

OpenCL 1.2

Intel(R) Corporation

Intel(R) HD Graphics 4400

detetcted mismatch: [ 0 X 1 ], expected = 4, actual = 55

detetcted mismatch: [ 1 X 1 ], expected = 4, actual = 55

detetcted mismatch: [ 2 X 1 ], expected = 4, actual = 55

detetcted mismatch: [ 3 X 1 ], expected = 4, actual = 55

detetcted mismatch: [ 4 X 1 ], expected = 4, actual = 55

detetcted mismatch: [ 0 X 2 ], expected = 4, actual = 55

detetcted mismatch: [ 1 X 2 ], expected = 4, actual = 55

detetcted mismatch: [ 2 X 2 ], expected = 4, actual = 55

detetcted mismatch: [ 3 X 2 ], expected = 4, actual = 55

detetcted mismatch: [ 4 X 2 ], expected = 4, actual = 55

detetcted mismatch: [ 5 X 2 ], expected = 4, actual = 55

detetcted mismatch: [ 6 X 2 ], expected = 4, actual = 0

detetcted mismatch: [ 7 X 2 ], expected = 4, actual = 0

show only few first mismatches.., test FAILED

 

To build this test case CMake 3.1+ is required.

 

Regards,
  Vladimir

AttachmentSize
DownloadCMakeLists.txt351 bytes
Downloadiocl-bug.cpp26.82 KB

setting work_group_size crashes OpenCL on Intel CPU

$
0
0

Hi

I am transfering the reduction kernel from amd app sdk.

It requires setting work_group_size when you execute

clEnqueueNDRangeKernel  with local_work_size that is different from 8 it crashes directly in tbb on Intel OpenCL for Intel CPU. The clEnqueueNDRRange successfully launches the kernel.

When you request work_group_size from the device it returns 8192 (should be 8 in this case) and the kernel work group size is 2048. It crashes with both settings.

Works only with the number of the cores.

I have Intel Haswell 4770K.

I have global_size = 4096;

Intel 4600 GPU works fine for all different sizes according to spec.

The project is located here:

https://github.com/kingofthebongo2008/dare12_opencl

the file that launches the kernel is located here:

https://github.com/kingofthebongo2008/dare12_opencl/blob/master/src/free...

 

 

 

 

 

 

Come see mee at booth 502 at IDF15

$
0
0

I will be attending IDF15 for the next three days. If you have OpenCL or OpenCV questions and want to talk to me in person, come see me at booth 502, where we are going to showcase a number of cool Intel(R) INDE technologies.

HD4400 bitwise and operation on uchar2 data

$
0
0

Hello,
we are seeing different results when implementing "bitwise and" operation in OpenCL kernel working on uchar2 data. The OpenCL kernel code like this:
uchar2 val1;
uchar2 val2;
uchar3 res;
res = val1 & val2;

produce wrong results, while code like below:

res = (uchar2)(val1.x & val2.x, val1.y & val2.y)
produce correct result.

BTW, the same behaviour detected for bitwise or/xor and uchar3/4 data, although attached test case was prepared only for "bitwise and" on uchar2 data. 

Testing environment:
Lenovo X240 notebook,
Windows 7 Service Pack 1, x64,
Microsoft Visual Studio 2013,
Intel INDE Package ID: w_inde_2015.2.027
Video driver version is 10.18.14.4139 WHQL Win7 64

Regards,
  Vladimir
 

AttachmentSize
Downloadiocl-bug-uchar2.zip5.52 KB

change image format of image object in opencl

$
0
0

I have written a small code to create an image object in opencl as below:

img_fmt.image_channel_order = CL_R;
img_fmt.image_channel_data_type = CL_UNSIGNED_INT8;
memobj_in_luma = clCreateImage2D(p->context, CL_MEM_READ_ONLY, &img_fmt, p->width, p->height, 0, NULL, &ret);

After creating this object I want to change the image format to CL_RGBA. Is there any way to do this?


Compiling OpenCL 2.0 atomics

$
0
0

Hi,

I am trying to compile a simple kernel using OpenCL 2.0 atomics using exactly the device, driver, and kernel described in:

https://software.intel.com/en-us/forums/topic/556904

However, I cannot even get the kernel to compile, as it does not seem to recognize the atomic types and functions. My error log (along with some environment info) is:

  -- device info --
DEVICE_NAME:                Intel(R) HD Graphics 5500
DEVICE_VENDOR:              Intel(R) Corporation
DEVICE_VERSION:             OpenCL 2.0 
DRIVER_VERSION:             10.18.14.4029

:6:62: error: unknown type name 'atomic_int'
kernel void atomics_test(global int *output, volatile global atomic_int*  atomicBuffer, uint iterations, uint offset)
                                                             ^
:10:9: error: implicit declaration of function 'atomic_fetch_add_explicit' is invalid in OpenCL
        atomic_fetch_add_explicit(&atomicBuffer[0], MY_ADD_VALUE, memory_order_relaxed, memory_scope_device);
        ^

error: front end compiler failed build.

Am I missing an include or an extension to enable? I tried searching through the docs but couldn't find anything. For example, the Atomic Function page doesn't seem to describe any includes or extensions to use 'atomic_int':

https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/atomicFunctio...

Thanks a lot in advance!

kernel “vector + vector”, return the right result only if vector's length is a multiple of 64

$
0
0

I'm new to OpenCL. And I'm trying to run a kernel “vector + vector”, I could get the right result only if vector's length equals  a multiple of 64. For example, I will get the output below when I set the length to 16. 

No protocol specified
platform 1: vendor 'Intel(R) Corporation'
 device 0: 'Intel(R) HD Graphics'
0 + 16 = 0
1 + 15 = 0
2 + 14 = 0
3 + 13 = 0
4 + 12 = 0
5 + 11 = 0
6 + 10 = 0
7 + 9 = 0
8 + 8 = 0
9 + 7 = 0
10 + 6 = 0
11 + 5 = 0
12 + 4 = 0
13 + 3 = 0
14 + 2 = 0
15 + 1 = 0

 

You can find the code from this website http://www.eriksmistad.no/getting-started-with-opencl-and-gpu-computing/

Environment:

  • CentOS 7.1
  • i7 4790
  • OpenCL 1.2
  • SDK: Intel SDK  2015 Production16.4.2.1 from Intel Media Server Studio Community version.

OpenCL on Intel HD 2500 on Ubuntu

$
0
0

Hi, 

I have a machine with Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz and Intel HD 2500 running Ubuntu 14 and I would like to run some OpenCL code on the Graphics processor. 

Did anyone succeeded in installing OpenCL for this configuration ? 

Thanks, 

Multiple Map/Unmap buffer

$
0
0

According to OpenCL 1.2 spec:

​clEnqueueMapBuffer and clEnqueueMapImage increment the mapped count of the memory object. The initial mapped count value of a memory object is zero. Multiple calls to clEnqueueMapBuffer or clEnqueueMapImage on the same memory object will increment this mapped count by appropriate number of calls. clEnqueueUnmapMemObject decrements the mapped count of the memory object.​

But it happens that the 2nd mapping attempt returns an error code (-59).

I've prepared a simple reproducer for the problem (attached), it's output is:

Using device: Intel(R) HD Graphics 4600 (iGPU), ver OpenCL 1.2

Using platform: Intel(R) OpenCL, ver OpenCL 1.2

Creating cl::Buffer(CL_MEM_USE_HOST_PTR, 00000083B8E33430)

Performing multiple mappings (should use internal OpenCL counter as per Khronos)

Mapping buffer #1... returned 00000083B8E33430

Mapping buffer #2...

*****

OpenCL runtime error: clEnqueueMapBuffer(-59)

 

while the expected output is:

Creating cl::Buffer(CL_MEM_USE_HOST_PTR, 0000004AD96DE160)

Performing multiple mappings (should use internal OpenCL counter as per Khronos)

Mapping buffer #1... returned 0000004AD96DE160

Mapping buffer #2... returned 0000004AD96DE160

Mapping buffer #3... returned 0000004AD96DE160

Unmapping buffer #1

Unmapping buffer #2

Unmapping buffer #3

All done.

 

AttachmentSize
Downloadcl_map_unmap.zip2.17 KB

Argument list for built-in kernel convolve_2d_intel

$
0
0

There is a built-in kernel called "convolve_2d_intel" on the Intel Iris Pro Graphics 6200 GPU in the i7-5775C processor.

What is the argument list for this built-in kernel? Is there any documentation?

Thanks,

Andrew

Viewing all 1182 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>