Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

Apparent memory leak and performance problems

$
0
0

Dear all,

my company is developing a scientific application with ~30000 registered users. We encountered some problems with OpenCL support in Windows 8.1, using the latest driver 10.18.14.4080, and testing with the built-in GPU of a Core i7 4770, the application is 32bit.

1) A memory leak-like issue: The application has a non-performance critical loop which looks like..

- Create kernels
for (i=0;i<100;i++)
{ - Create OpenCL buffers
  - Run kernels
  - Free OpenCL buffers }

Unfortunately this loop makes it only till the 3rd iteration, then the OpenCL buffer creation fails with CL_MEM_OBJECT_ALLOCATION_FAILURE. In the Windows task manager, I can see the memory usage go through the roof.

Of course I am only reporting this because the same code runs fine when using OpenCL devices from AMD or nVIDIA (where the task manager shows a constant number during all 100 iterations).

I also implemented a counter, which I increment for each allocation (clCreateProgramWithSource, clCreateKernel, clCreateBuffer, clCreateImage) and decrement for each deallocation (clReleaseProgram, clReleaseKernel, clReleaseMemObject), and this counter does not increase with each loop iteration, so I'm pretty sure the leak or fragmentation problem is not in my code.

If your driver team wants to investigate this, I can of course send you the application for testing (I can't really extract a minimum code example).

2) A performance issue: For the two performance critical parts of my application (not the loop above), a Radeon R9 290X executes the kernels 18x and 12x as fast as the HD4600 GPU in the Core i7 4770. But in 3DMark11, the Radeon runs only 3.5x as fast (http://www.futuremark.com/hardware/gpu/Intel+HD+Graphics+4600/review). So somewhere, I have a massive performance loss. Now I have three questions:

a) My code uses a lot of barriers(CLK_LOCAL_MEM_FENCE) after copying data from global memory or when sharing data between work items. On AMD and nVIDIA GPUs, these barriers are optimized away by the compiler, because I declare kernels with __attribute__((reqd_work_group_size(CL_WGSIZE,1,1))), where CL_WGSIZE matches the warp/wavefront size. But the Intel compiler seems to include barrier code (the kernels break if I remove some barriers). Is there any way / work group size to help the compiler? I tried work group sizes 32 and 64, but no luck.

b) On AMD and nVIDIA GPUs, I can obtain the assembler code created for the kernels and check that everything has been translated as expected. Is there any way to check also with Intel GPUs what the compiler did to identify bottlenecks?

c) Which is the preferred, not excessively expensive way to identify bottlenecks? Our software development in done in Linux, Windows executables ar cross-compiled.

Many thanks for your help,
Elmar

 


Viewing all articles
Browse latest Browse all 1182

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>