Sandy Bridge, INDE OK while IOC64 and runtime fails (W8.1 -I7 2760)

May 29, 2015, 3:10 am

Latest and popular articles on Intel Technologies

≫ Next: Problems with reduction done in CPU

≪ Previous: OpenCL extension for double precision, cl_khr_fp64, on a Ubuntu 14.04 virtualbox running within Windows 7 using Core i7 2620M

Hi,

I have a piece of code that runs fine on Ivy bridge and later CPU's. On Sandy Bridge (2760) it will not vectorize and hence it will not perform.

Now, the same code will compile and run if we use your competitors runtime.

Our problem is that the INDE compiler will vectorize it but when we use the IOC64 directly it simply says compile failure. Is it safe to assume that the toolkit is using the same compiler?

Further, the -Scholar mode does not work?

Any ideas?

Geir

↧

Problems with reduction done in CPU

May 29, 2015, 11:29 am

Latest and popular articles on Intel Technologies

≫ Next: Can INT16 achieve 2x throughput compared with INT32 on Broadwell?

≪ Previous: Sandy Bridge, INDE OK while IOC64 and runtime fails (W8.1 -I7 2760)

Hi all. I have been trying to code reductions for CPU and GPU. The kernels attached below work really

well for Intel GPU's and Nvidia GPU. But, when I compile for CPU (Intel). The results are not consistent.

Sometimes, the result is right sometimes the result is wrong. There are two kernels: reduction_vector

is called many times by the host. When, the global_size is reduced to local_size. I issue complete_vector to finalize

the reduction.

__kernel void reduction_vector(__global int* data, __local int* partial_sums)

{

int lid = get_local_id(0);

int group_size = get_local_size(0);

partial_sums[lid] = data[get_global_id(0)];

barrier(CLK_LOCAL_MEM_FENCE);

for (int i = 1; i < group_size; i <<= 1) {

int mask = (i << 1) - 1;

if ((lid & mask) == 0) {

partial_sums[lid] += partial_sums[lid + i];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

if(lid == 0) {

data[get_group_id(0)] = partial_sums[0];

}

__kernel void reduction_complete(__global int* data,

__local int* partial_sums, __global int *sum) {

int lid = get_local_id(0);

int group_size = get_local_size(0);

partial_sums[lid] = data[get_local_id(0)];

barrier(CLK_LOCAL_MEM_FENCE);

for (int i = 1; i < group_size; i <<= 1) {

int mask = (i << 1) - 1;

if ((lid & mask) == 0) {

partial_sums[lid] += partial_sums[lid + i];

}

barrier(CLK_LOCAL_MEM_FENCE);

}

if(lid == 0) {

*sum = partial_sums[0];

}

This is the host code

local_size = 128;

/* Create data buffer */

data_buffer = clCreateBuffer(oclobjects.context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(cl_int)* ARRAY_SIZE, data, &err);

sum_buffer = clCreateBuffer(oclobjects.context, CL_MEM_WRITE_ONLY,sizeof(cl_int), NULL, &err);

if(err < 0) {

perror("Couldn't create a buffer");

exit(1);

};

clEnqueueWriteBuffer(oclobjects.queue, data_buffer, CL_TRUE, 0, sizeof(cl_int) * ARRAY_SIZE, data, 0, NULL, NULL);

clFinish(oclobjects.queue);

/* Set arguments for vector kernel */

err = clSetKernelArg(vector_kernel, 0, sizeof(cl_mem), &data_buffer);

err |= clSetKernelArg(vector_kernel, 1, local_size * sizeof(cl_int), NULL);

/* Set arguments for complete kernel */

err = clSetKernelArg(complete_kernel, 0, sizeof(cl_mem), &data_buffer);

err |= clSetKernelArg(complete_kernel, 1, local_size * sizeof(cl_int), NULL);

err |= clSetKernelArg(complete_kernel, 2, sizeof(cl_mem), &sum_buffer);

if(err < 0) {

perror("Couldn't create a kernel argument");

exit(1);

}

/* Enqueue kernels */

global_size = ARRAY_SIZE;

err = clEnqueueNDRangeKernel(oclobjects.queue, vector_kernel, 1, NULL, &global_size,

&local_size, 0, NULL, NULL);

if(err < 0) {

perror("Couldn't enqueue the kernel");

exit(1);

}

printf("Global size = %lu\n", global_size);

/* Perform successive stages of the reduction */

while(global_size/local_size > local_size) {

global_size = global_size/local_size;

err = clEnqueueNDRangeKernel(oclobjects.queue, vector_kernel, 1, NULL, &global_size,

&local_size, 0, NULL, NULL);

printf("Global size = %lu\n", global_size);

if(err < 0) {

perror("Couldn't enqueue the kernel");

exit(1);

}

global_size = global_size/(local_size);

local_size = global_size;

err = clEnqueueNDRangeKernel(oclobjects.queue, complete_kernel, 1, NULL, &global_size,

&local_size, 0, NULL, NULL);

printf("Global size = %lu\n", global_size);

/* Read the result */

err = clEnqueueReadBuffer(oclobjects.queue, sum_buffer, CL_TRUE, 0, sizeof(cl_int), &sum, 0, NULL, NULL);

clFinish(oclobjects.queue);

if (err < 0) {

perror("Couldn't read the buffer");

exit(1);

}

/* Finish processing the queue and get profiling information */

clFinish(oclobjects.queue);

It does look to me that this Intel's bug in the CPU runtime. Notice, I tried two runtimes:

1. Runtime 14.2 x64

2. Runtime 15.1.x64

Thanks, for your help....

Diego

↧

Can INT16 achieve 2x throughput compared with INT32 on Broadwell?

June 1, 2015, 10:23 pm

Latest and popular articles on Intel Technologies

≫ Next: Multithreading enqueue commands in Intel MIC OpenCL

≪ Previous: Problems with reduction done in CPU

Hi,

According to Gen8.pdf,

'These units can SIMD execute up to four 32-bit floating-point (or integer) operations, or SIMD execute up to eight 16-bit integer or 16-bit floating-point operations.'

It seems that INT16 can achieve 2x peak throughput compared with INT32.

In Gen8.pdf, the table shows that for HD Graphics 5300, 32b integer IOPS = 192 IOP/cyc. Then, does it mean 16bit integer IOPS = 192*2 IOP/cyc?

Is my understanding right?

Asking this, because from my test, I can hardly see 2x throughput increase when using 'short' data type. I get almost same performance between when I switch from 'int' data type to 'short'.

Any comments?

↧

Multithreading enqueue commands in Intel MIC OpenCL

June 2, 2015, 1:45 pm

Latest and popular articles on Intel Technologies

≫ Next: What is "Graphics Video Max Memory" and its impact on OpenCL?

≪ Previous: Can INT16 achieve 2x throughput compared with INT32 on Broadwell?

Hi all, I encountered a very weird behavior. I ran the following codes:

t0 = get_time();
clEnqueueWriteBuffer(queue, mem, CL_FALSE, 0, 1.8*GB, host, 0, NULL, NULL);
printf("%lf secs", get_time() - t0);

The evaluation system has 4 Intel Xeon Phi 5110p coprocessors. (with Intel OpenCL runtime 14.2 and MPSS 3.4.2)
When I ran the code using MPI, that is 4 MPI-task, each task showed about 0.0000x secs.
But when I ran the code using threads, such as 4 OpenMP threads, it showed about 5 secs. Even though it is a enqueuing a non-blocking command.
Do you have any idea?

Thanks.
Jungwon

↧

What is "Graphics Video Max Memory" and its impact on OpenCL?

June 2, 2015, 9:10 pm

Latest and popular articles on Intel Technologies

≫ Next: vstore4 cannot write data to global address space on Intel HD4600 GPU in OpenCL kernel

≪ Previous: Multithreading enqueue commands in Intel MIC OpenCL

I noticed today's announced desktop and server Broadwell CPU's have very different "Graphics Video Max Memory" attributes.

Can someone from Intel explain how to interpret the "Graphics Video Max Memory" attribute and whether it impacts the HD Graphics OpenCL environment?

Is this just a limit for rendering surfaces addressable by QuickSync and the IGP?

Here's the comparison on ARK:

↧

vstore4 cannot write data to global address space on Intel HD4600 GPU in OpenCL kernel

June 3, 2015, 11:18 am

Latest and popular articles on Intel Technologies

≫ Next: MSS installation error

≪ Previous: What is "Graphics Video Max Memory" and its impact on OpenCL?

The code is simple:

typedef struct _Class {
ulong vtable;
ulong id;
} Class;
//code below is in kernel function
__global Class* psrc = (__global Class*) param1; //param1 is any valid __global kernel pointer parameter (length >= sizeof(Class))
__global Class* pdest = (__global Class*) param2; //param2 is any valid __global kernel pointer parameter (length >= sizeof(Class))
uint4 ui4 = vload4(0, (__global uint*) psrc);
vstore4(ui4, 0, (__global uint*) pdest);
printf("%#v4hlX vtable=%ld\n", ui4, pdest->vtable);

The result shows on HD4600 GPU pdest is not modified from param2 to be param1. While same code works fine on AMD and Nvidia GPU.

I also found code below works on HD4600 which removes the __global address space qualifier:
Class src = {3333, 200};
Class dest = {2222, 100};
uint4 ui4 = vload4(0, (__global uint*) &src);
vstore4(ui4, 0, (__global uint*) &dest);
printf("%#v4hlX vtable=%ld\n", ui4, dest.vtable);

Is it a bug? Any comments are appreciated.

↧

MSS installation error

June 3, 2015, 12:20 pm

Latest and popular articles on Intel Technologies

≫ Next: Code Builder 2015 (Ubuntu Linux)

≪ Previous: vstore4 cannot write data to global address space on Intel HD4600 GPU in OpenCL kernel

Hi,

I went into an error when trying to install MSS 2015 R4 Pro Edition. The error message is:

Failed to process component: Graphics Driver

.......

My system is a Intel NUC with i5-5300U and Intel HD graphics 5500.running on Windows 7

I can see the Graphics 5500 is not on the supported list. If I continue with the installation, what features and capabilities am I giving up? What choices do I have?

I want to take advantage of the encode/decode accelerator and determine how many encode/decode threads (1080p signal) it can support.

Thanks

↧

Code Builder 2015 (Ubuntu Linux)

June 3, 2015, 1:34 pm

Latest and popular articles on Intel Technologies

≫ Next: How to integrate VLC with MSS QuickSync video

≪ Previous: MSS installation error

Hi Folks.

I installed Intel Code Builder on my Ubuntu 14.04. Installation was successful, as reported by installation script.

Now, how I can use it? I cannot find any executable for it in my binary directories, and Code Builder manual DOES NOT mention how to launch it. Same story for its release notes.

I would be perfectly satisfied even using just the runtime included in Code Builder, that is, writing OpenCL code with a text editor and running it on my Xeon E3-1246v3 integrated GPU (no specific need for an IDE).

Any help would be greatly appreciated. Thanks.

↧

How to integrate VLC with MSS QuickSync video

June 3, 2015, 3:53 pm

Latest and popular articles on Intel Technologies

≫ Next: READ ME FIRST: how to get a fast response to your OpenCL questions?

≪ Previous: Code Builder 2015 (Ubuntu Linux)

Hi,
I have MSS Pro installed on my system. I would like to run VLC using the MSS video accelerator to benchmark some video apps. How do I know if VLC is using the MSS QuickSync accelerator ? If not, how do I configure VLC to use it?
Thanks

my platform is a NUC with i5-5300U, HD Graphics 5500, running Windows 7

↧

READ ME FIRST: how to get a fast response to your OpenCL questions?

June 4, 2015, 2:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Questions about Intel's opencl platform

≪ Previous: How to integrate VLC with MSS QuickSync video

Dear Customers,

If you have an issue with Intel OpenCL tools or drivers, please follow this check list when reporting it on the forum to ensure fast service:

Please let us know what Processor, Operating System, Graphics Driver Version, and Tool Version you are using
Please state steps to reproduce the issue as precisely as you possibly could
If you are using command line tools, please provide the full command line
If code is involved, it is great to create a small "Reproducer" sample and attach it to the message in the form of a zip file
If you don't want your code to be seen by other forum users, please send me a private message
Before posting, search the forum to see if someone already answered a similar question

Thank you!

↧

Questions about Intel's opencl platform

June 5, 2015, 4:47 pm

Latest and popular articles on Intel Technologies

≫ Next: OpenCL will not install!!!

≪ Previous: READ ME FIRST: how to get a fast response to your OpenCL questions?

I'm new to Intel's opencl platform, but have read good things about its performance.

I need to know if I am compatible with your platform, b4 venturing in too deep.

I use Ubuntu 14.04 x64. My card is a radeon Curacao Pro R9 270 2 GB, and my CPU is AMD FX 8320 8-core with 8 GB memory. Unfortunately, I cannot find a working opencl debugger from AMD (CodeXL 1.7 doesn't work and i cannot find earlier releases). So I was wandering if your platform could generate code for this setup, and if Code Builder could debug it.

↧

OpenCL will not install!!!

June 8, 2015, 11:15 pm

Latest and popular articles on Intel Technologies

≫ Next: How to use Pipe in OpenCL2.0

≪ Previous: Questions about Intel's opencl platform

So I downloaded the OpenCL driver for Windows 7 64bit and also for my i7-4790k with Hd 4600 Graphics. When I go to install it a little window pops up and says my pc does not meet minimum requirements.

Specs:

System CPU : I7-4790K @ Stock, CPU Cooler: Noctua NH-D15, Motherboard: Asus Sabertooth Mark 2, RAM: 16 GB G.Skill Sniper Series @ 1866MHz, GPU: GeForce GTX 980 ACX 2.0, Storage: Samsung 120GB 840 EVO, WD Cavier Black 1TB, Psu: Corsair HX 750w, Case: Fractal Design r4 Black Pearl Window, OS: Windows 7 64bit Home Premium

↧

How to use Pipe in OpenCL2.0

June 10, 2015, 8:51 am

Latest and popular articles on Intel Technologies

≫ Next: Running OpenCL kernel (s) on CPU and GPU

≪ Previous: OpenCL will not install!!!

Hi guys!
i am tring to use Pipe in OpenCL2.0. But I am confused about the built-in Pipe read and write functions,such as write_pipe, reserver_read_pipe.

Code :

reserve_id_t reserve_write_pipe (
                                        pipe gentype p,
                                       uint num_packets)

Code :

int write_pipe (pipe gentype p,
                                reserve_id_t reserve_id,
                                uint index,
                               const gentype *p

1. For the reserve_write_pipe() function, Description is that

Code :

Reserve num_packets entries for writing to pipe p. Returns a valid reservation ID if the reservation is successful.

I don't know what's the meaning of the num_packets argument.
2. For write_pipe() function,Description is that

Code :

Write packet specified by ptr to the reserved area of the pipe referred to by reserve_id and index.

What's the meaning of "reserved area of the pipe"? and What's the meaning of "reserve_id " and "index"?
3. If i want to write to Pipe by order(such as work-item 0 writes 0 to the pipe,work-items 1 write 1 to the pipe,.. work-item n writes n to the pipe,so the data in the pipe is "012345……n" ,and 0 is the first one in the pipe),what should I do ?
Can someone help me please? Thanks.
hi_buddy

↧

Running OpenCL kernel (s) on CPU and GPU

June 11, 2015, 1:19 pm

Latest and popular articles on Intel Technologies

≫ Next: Only OpenCL 1.2 on Intel media server studio

≪ Previous: How to use Pipe in OpenCL2.0

Hi,

We have installed intel media server studio essential 2015 R5 took kit. We have CentOS 7.1 running on Intel i7-4790K processor with HD4600 graphics.

I want to confirm/ask following points related to running OpenCL programs with the new OpenCL Driver:

1) Although the website (https://software.intel.com/en-us/articles/opencl-drivers , 3rd option) says that the OpenCL driver on Linux OS supports only GPU, I am able to run the programs on both CPU and GPU by using CL_DEVICE_TYPE_{GPU/CPU} options. I am wondering if there is any limitation that I should be aware of while running a program on CPU. Please confirm.

2) Is it possible to execute multiple kernels at the same time on GPU? More specifically, I would like to know about the scheduling behavior in following cases:

(a) When multiple kernels (with no dependencies among them) are launched from an application through multiple command-queues: Do these kernels run concurrently, say when a particular kernel does not use all available cores of GPU? I know that there are four time stamps in a kernel execution: enqueue, submit, start, end. Can multiple kernels have an overlap between start and end? I think Device fission is one way of dividing the number of cores in to different groups, but I am wondering if the GPU hardware has an internal mechanism to run multiple kernels without Device fission?

(b) When multiple kernels are launched from DIFFERENT applications: Here, different kernels belong to different programs. How does the OpenCL driver schedule kernels in this case? Can multiple kernels "run" concurrently?

Thanks,

Kapil

↧

Only OpenCL 1.2 on Intel media server studio

June 11, 2015, 8:07 pm

Latest and popular articles on Intel Technologies

≫ Next: Local Memory read/writes fail on Surface Pro 3 i5 GPU

≪ Previous: Running OpenCL kernel (s) on CPU and GPU

Hello everyone. I have just started to program OpenCL on Intel SDK with my laptop. (I originally use AMD's on my desktop)
It's stated that Intel code builder supports OpenCL 2.0 form this page.
https://software.intel.com/en-us/opencl-code-builder

However, after download the r5 community version for Linux and launch clinfo. It tells me that the CPU and GPU only OpenCL 1.2 is avliable.
Why this happens? And how can I fix this.
Note : I know that Nvidia doesn't support OpenCL 2.0 yet, but I want to do some stuff with the internet GPU first.

OS : Ubuntu 15.04 64bit
Intel media server studio : community 2015 r5(also Nvidia's OpenCL (installed from xorg-edges))
On chip GPU : HD 4600
CPU : I5 4210H
External GPU : gtx860m
Nvidia driver : 349

↧

Local Memory read/writes fail on Surface Pro 3 i5 GPU

June 12, 2015, 11:16 am

Latest and popular articles on Intel Technologies

≫ Next: Can Intel I7 3rd or 4rd genration run OpenCl on GPU processor graphics

≪ Previous: Only OpenCL 1.2 on Intel media server studio

Hi,

I would like to report a problem I discovered when running clFFT on Surface Pro i5 GPU Driver Version 10.18.14.4222 Driver Date 5/22/2015. After some tedious investigation I think I could boil the problem down to local memory read/write synchronization failures if specific memory access patterns are used.

Tested devices where the problem does not exist (I ran the same kernel on those devices): Surface 3 (non pro) GPU and CPU, MacBookPro 11,3 dedicated Nvidia GPU, Surface Pro i5 CPU. The problem only exists on Surface Pro i5 GPU.

I determined that the local work group collaborates on filling local memory with data. Then the local work group synchronizes using a barrier. Then the local work group collaborates on reading data from local memory in a different pattern than the pattern that was used to write the data. A higher level explanation of this operation is: work items within a local work group interchange intermediate results through shared memory.

I determined that this operation is faulty. The behaviour observed is similar to behaviour one would expect if reads and writes were not separated by a barrier. A small modification of the memory access pattern solved this problem: when adding +1 to each read and write from and to local memory, the kernel behaves correctly and no error is observed. The following table shows the expected output on the left generated by the CPU, on the right the incorrect output is illustrated.

This https://gist.github.com/sschaetz/f37e15ec2f059e13777b contains a reduced reproducer code that fails. It contains the offending kernel in kstr and runs the same kernel on OpenCL platform 0 device 0 and device 1 and compares the output. On Surface Pro 3 i5 this corresponds to GPU and CPU. As is, the code should produce the error. If the first line in the kernel string is commented, and the second line is uncommented, the error should disappear. It requires pyOpenCL to work.

The incorrect output is:

The expected output is:

Based on this I conjecture that either the OpenCL compiler determines the barrier shielding overlapping reads from writes can be omitted or the GPU does not execute the barrier correctly.

Best,

Sebastian

↧

Can Intel I7 3rd or 4rd genration run OpenCl on GPU processor graphics

June 13, 2015, 10:24 am

Latest and popular articles on Intel Technologies

≫ Next: Eclipse Luno does not load after linking with Opencl Code Builder

≪ Previous: Local Memory read/writes fail on Surface Pro 3 i5 GPU

Hi,

I'm using Intel I7 3rd and 4rd.

(I'm using Linux)

I dont have an external GPU, so I intend to use the GPU on the processor.

-- Can I run OpenCL code on the processor's GPU ?

-- If not, which programming language can I use to run on processor's GPU ?

Thanks

↧

Eclipse Luno does not load after linking with Opencl Code Builder

June 14, 2015, 9:04 am

Latest and popular articles on Intel Technologies

≫ Next: V-Ray fails to compile for Intel HD 4400 and probably any iGPU due to common runtime.

≪ Previous: Can Intel I7 3rd or 4rd genration run OpenCl on GPU processor graphics

I am using Centos 7 to use Opencl Code Builder. I have already downloaded and installed Intel Media Server Studio for Centos 7 OS. To start programming with Opencl, I use Eclipse Luno (Eclipse for Parallel Application Developers version) and the Eclipse IDE is loaded properly. To link with Code Builder, I followed the discussion here. The instructions are

To enable the OpenCL™ API Offline Compiler plug-in for Eclipse* IDE, do the following:

Copy the plug-in *.jar file from $(INTELOCLSDKROOT)\bin\eclipse-plug-in to $(ECLIPSEROOT)\dropins.
On Linux* OS add $(INTELOCLSDKROOT)\bin to LD_LIBRARY_PATH.
Run Eclipse IDE.
Select Window> Preferences.
Switch to the Intel OpenCL dialog and set OpenCL binary directory$(INTELOCLSDKROOT)\bin\. I have already finished 1 and 2. But when I run Eclipse IDE, I have error and IDE stop loading in halfway. The error is "Cannot get machine list. Could not load required libraries. Please make sure to set the correct path under the Code Builder for Opencl preference page". What is wrong with that error? Thanks

↧

V-Ray fails to compile for Intel HD 4400 and probably any iGPU due to common runtime.

June 15, 2015, 3:28 am

Latest and popular articles on Intel Technologies

≫ Next: Out of memory when compiling 'big' kernel for HD4600 GPU

≪ Previous: Eclipse Luno does not load after linking with Opencl Code Builder

Reading the V-Ray output log it is clear it fails to compile the raytracer for Intel HD 4400 OpenCL runtime. It doesn't matter the scene's complexity it happens always.

Qt: Untested Windows version 6.3 detected!
Successfully initialized spawner. Waiting for jobs...
Received TM_START_RENDER from 127.0.0.1
Starting DR session from 127.0.0.1
Receiving DR scene from 127.0.0.1
[DR] getting sequence data...
[DR] Calling beginSeqeunce()
Preparing renderer...
Preparing scene for rendering...
OpenCL renderer requested.
loadLibrary(C:\ProgramData\ASGVIS\Common\x64\vc10\Distributed Rendering\rt_opencl.dll)
Found plugin "RTOpenCL"
Plugin library "C:\ProgramData\ASGVIS\Common\x64\vc10\Distributed Rendering\rt_opencl.dll" loaded.
1 plugin(s) loaded successfully
OpenCL renderer plugin successfully loaded from "C:\ProgramData\ASGVIS\Common\x64\vc10\Distributed Rendering\rt_opencl.dll"
OpenCL renderer plugin instance successfully created.
EXT_RTOPENCL interface obtained successfully from OpenCL renderer plugin instance.
[DR] getting frame data...
[DR] frame number is 0
[DR] Calling beginFrame()
Preparing camera sampler.
Preparing scene for frame...
Compiling geometry...
Preparing ray server.
        Building SDTree for GPU
        Building SDTree for GPU
        Scene is empty.
Preparing direct light manager.
Preparing global light manager.
[DR] Calling renderImage()
Running RTEngine
Initializing OpenCL renderer (single kernel version)...
Querying for OpenCL devices...
Environment variable VRAY_OPENCL_PLATFORMS_x64 not found - using all available devices
Found 2 OpenCL platforms
Using the following OpenCL devices:
Intel(R) OpenCL Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz
Intel(R) OpenCL Intel(R) HD Graphics 4400
NVIDIA CUDA GeForce GT 750M
CPU TUPLE SIZE = 2
Using memory buffers for textures
Using global ray states
cl_nv_compiler_options extension not found.
Building OpenCL trace program for device Intel(R) OpenCL_Intel(R) HD Graphics 4400...
Error Program build failure (-11) at line 1467 , in file ./src/opencl_main.cpp !!!

Failed to compile OpenCL kernels, falling back to CPU code.
buildProgram() failed for device 0
initDevices() failed.
Number of lights: 0
Number of area lights: 0
Number of moving area lights: 0
Total number of lights added by updateLights(): 0
Setting up 4 thread(s)
        Number of raycasts: 0
         Camera rays: 0
         Shadow rays: 0
         GI rays: 0
         Reflection rays: 0
         Refraction rays: 0
         Unshaded rays: 0
        Clearing global light manager.
        Clearing direct light manager.
        Clearing ray server.
        Clearing geometry.
        Number of intersectable primitives: 0
         SD triangles: 0
         MB triangles: 0
         Static primitives: 0
         Moving primitives: 0
         Infinite primitives: 0
        Clearing camera image sampler.
        Clearing camera sampler.
        Clearing DMC sampler.
        Clearing path sampler.
        Clearing color mapper.
        Scene constructed in 4.9 seconds
        Preparing renderer...
        Preparing scene for rendering...
        [RenderView] startCameraTime=0.000000, endCameraTime=1.033333
        [RenderView] numCameraTMs=62, numFrames=31, frameSamples=2
        OpenCL renderer requested.
        OpenCL renderer plugin already loaded.
        OpenCL renderer plugin instance successfully created.
        EXT_RTOPENCL interface obtained successfully from OpenCL renderer plugin instance.
        [RenderView] startCameraTime=0.000000, endCameraTime=1.033333
        [RenderView] numCameraTMs=62, numFrames=31, frameSamples=2
        Preparing camera sampler.
        Preparing scene for frame...
        Compiling geometry...
        Preparing ray server.
                Building SDTree for GPU
                Building SDTree for GPU
                Scene bounding box is [-28.7711,-29.052,3.27523]-[28.1591,30.7112,197.953]
        Preparing direct light manager.
        Preparing global light manager.
        Number of lights: 2
        Number of area lights: 0
        Number of moving area lights: 0
        Total number of lights added by updateLights(): 1
Writing crash dump to "C:\Users\(*)\AppData\Local\Temp\VRay.dmp"

↧

Out of memory when compiling 'big' kernel for HD4600 GPU

June 15, 2015, 9:24 am

Latest and popular articles on Intel Technologies

≫ Next: clEnqueueMap/UnmapBuffer overheads

≪ Previous: V-Ray fails to compile for Intel HD 4400 and probably any iGPU due to common runtime.

In fact my kernel file is not big. It contains about 400 lines code, 10 funtions and one kernel function. Some code in kernel function were commented out. Gradually recovering the code from comments, I found when total quantity of code exceeds some threadhold, clBuildProgram will complain error:

Error in clBuildProgram (-11): CL_BUILD_PROGRAM_FAILURE
fcl build 1 succeeded.
fcl build 2 succeeded.
Error: out of memory.

As far as I know, Intel Core GPU does not have its own memory. It uses CPU memory to work. The physicsal memory is very big. Bigger quantity of code can be successfully compiled on AMD HD5870 descrete GPU card which contains only 1GB memory.

So, are there some solutions to solve the compilation OOM issue?
the CPU is i7-4710MQ , Windows 10

↧