Apparent memory leak and performance problems

Dear all,

my company is developing a scientific application with ~30000 registered users. We encountered some problems with OpenCL support in Windows 8.1, using the latest driver 10.18.14.4080, and testing with the built-in GPU of a Core i7 4770, the application is 32bit.

1) A memory leak-like issue: The application has a non-performance critical loop which looks like..

- Create kernels
for (i=0;i<100;i++)
{ - Create OpenCL buffers
- Run kernels
- Free OpenCL buffers }

Unfortunately this loop makes it only till the 3rd iteration, then the OpenCL buffer creation fails with CL_MEM_OBJECT_ALLOCATION_FAILURE. In the Windows task manager, I can see the memory usage go through the roof.

Of course I am only reporting this because the same code runs fine when using OpenCL devices from AMD or nVIDIA (where the task manager shows a constant number during all 100 iterations).

I also implemented a counter, which I increment for each allocation (clCreateProgramWithSource, clCreateKernel, clCreateBuffer, clCreateImage) and decrement for each deallocation (clReleaseProgram, clReleaseKernel, clReleaseMemObject), and this counter does not increase with each loop iteration, so I'm pretty sure the leak or fragmentation problem is not in my code.

If your driver team wants to investigate this, I can of course send you the application for testing (I can't really extract a minimum code example).

2) A performance issue: For the two performance critical parts of my application (not the loop above), a Radeon R9 290X executes the kernels 18x and 12x as fast as the HD4600 GPU in the Core i7 4770. But in 3DMark11, the Radeon runs only 3.5x as fast (http://www.futuremark.com/hardware/gpu/Intel+HD+Graphics+4600/review). So somewhere, I have a massive performance loss. Now I have three questions:

a) My code uses a lot of barriers(CLK_LOCAL_MEM_FENCE) after copying data from global memory or when sharing data between work items. On AMD and nVIDIA GPUs, these barriers are optimized away by the compiler, because I declare kernels with __attribute__((reqd_work_group_size(CL_WGSIZE,1,1))), where CL_WGSIZE matches the warp/wavefront size. But the Intel compiler seems to include barrier code (the kernels break if I remove some barriers). Is there any way / work group size to help the compiler? I tried work group sizes 32 and 64, but no luck.

b) On AMD and nVIDIA GPUs, I can obtain the assembler code created for the kernels and check that everything has been translated as expected. Is there any way to check also with Intel GPUs what the compiler did to identify bottlenecks?

c) Which is the preferred, not excessively expensive way to identify bottlenecks? Our software development in done in Linux, Windows executables ar cross-compiled.

Many thanks for your help,
Elmar

Apparent memory leak and performance problems

Trending Articles

Lady Gaga – MAYHEM (Bonus Tracks Version) [iTunes Rip M4A]

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Griffith faces three more offences

How to set Page Break/ Page Reset in Smartforms using Events

Motrex mtxm100ja

Telangana TS New Food Security Card/ Telangana Ration card Application Form...

WALLACE; JACQUELINE

Read GOS (Generic Object Service) Picture Attachments and Display it into...

Practice Sheet of Right form of verbs for HSC Students

TBT: Samini “Tempo” Feat Mugeez (R2Bees) Prod by Kaywa

Aaron Haywood – Hyde

The Ultimate Doors Discography - 90 Albums - All MP3's

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Black Angus Grilled Artichokes

Mp3 Download: Mdu - Mazola

Hucknall burglar's attempt to break into woman's home foiled

ROBERT F TOSTA Arrested by Miami-Dade County Corrections on Nov 12, 2016

Trial of East Grinstead man accused of rape to begin next week

Halestorm – Everest – Pre-Single [iTunes Plus M4A]

Schools benefit from American donation