But success in exe or static library projects.
Could you help? Thanks.
But success in exe or static library projects.
Could you help? Thanks.
Hello everyone!
I'm running into a problem where data is not being written to my buffer when the kernels finish. I've tested my kernel in isolation in Eclipse running in Ubuntu on an Intel i5 CPU and it seems to output the correct results. When I move it over to CentOS I can't get printf statements to return from the kernel and my output buffers are never written to. Here is an example of my code:
double * coef_elts = (double *) calloc(p * voxels, sizeof(double));
return_vec_1 = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, sizeof(double) * p * voxels, coef_elts, &err);
err = clSetKernelArg(kernel, 26, sizeof(cl_mem), &return_vec_1);
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, NULL, 0, NULL, NULL);
err = clEnqueueReadBuffer(queue, return_vec_1, CL_TRUE, 0, sizeof(double) * p * voxels, coef_elts, 0, NULL, NULL);
When I read the output it only contain the 0 data assigned by calloc. This wasn't the case in eclipse. If anyone has any suggestions on the code or getting an output in CentOS it would be much appreciated. I am aware CentOS is not supported but unfortunately I cannot change the OS.
Thank you!
In certain cases clEnqueueReadBuffer doesn't transfer all the required data when executed on HD4600. System: Win7 x64, driver version 15.36.19.64.4170, 32-bit application.
It seems that in case of page-aligned destination buffer and transfer length that is not multiple of 4KB only multiple of 4KB is transfered. Sample code:
size_t length = 76800; char *dst = VirtualAlloc(NULL, length, MEM_RESERVE|MEM_COMMIT|MEM_TOP_DOWN, PAGE_READWRITE); cl_mem buf = clCreateBuffer(context, CL_MEM_READ_WRITE, length, NULL, NULL); clEnqueueReadBuffer(command_queue, buf, CL_TRUE, 0, length, dst, 0, NULL, NULL);
The problem is present only on Intel GPU, the same code produce correct results when executed on AMD and Nvidia GPUs. The same incorrect behavior can be replicated on proper implementations by calling:
clEnqueueReadBuffer(command_queue, buf, CL_TRUE, 0, length & 0xFFFFF000, dst, 0, NULL, NULL);
Hello.
I've some experience with nVidia CUDA and some low-level optimizations regarding their GPUs. Recently I had to port that to OpenCL in order to make the code we write running on more hardware platforms. I've checked the Intel white paper regarding the Intel integrated GPUs, but so far I haven't manage to find what exactly (if any) is the difference between nVidia SIMT model and the SIMD model implemented in the Intel's integrated GPUs. Further more, I've failed to find any video lectures regarding the Intel GPU architecture (in contrast, there are the Stanford course in iTunes U, which is quite helpful).
Any help regarding those will be greatly appreciated.
Thanks.
Hi,
I am trying opencl 2.0 atomics on HD5500, following the https://software.intel.com/en-us/articles/using-opencl-20-atomics.
I use CL_DRIVER_VERSION: 10.18.14.4029.
But I find the atomic operations result is not as expected. The simplified version test is:
kernel void atomics_test(global int *output, volatile global atomic_int* atomicBuffer, uint iterations, uint offset)
{
for (int j = 0; j < MY_INNER_LOOP; j++)
atomic_fetch_add_explicit(&atomicBuffer[0], MY_ADD_VALUE, memory_order_relaxed, memory_scope_device);
}
I only run the kernel on GPU with 1 thread.
Before running the kernel, atomicBuffer[0] is initialized to 1.
Result:
MY_ADD_VALUE=1, MY_INNER_LOOP=1-->atomicBuffer[0]=7 (it seems to be 1+1*6)
MY_ADD_VALUE=1, MY_INNER_LOOP=2-->atomicBuffer[0]=13 (it seems to be 7+1*6)
MY_ADD_VALUE=1, MY_INNER_LOOP=3-->atomicBuffer[0]=19 (it seems to be 13+1*6)
MY_ADD_VALUE=2, MY_INNER_LOOP=1-->atomicBuffer[0]=13 (it seems to be 1+2*6)
MY_ADD_VALUE=2, MY_INNER_LOOP=2-->atomicBuffer[0]=25 (it seems to be 13+2*6)
It seems that atomic_fetch_add does (atom_variable+MY_ADD_VALUE*6), not (atom_variable+MY_ADD_VALUE).
Is it a known issue? Or is my test somewhere wrong?
I'm curious if there are any circumstances that will result in an implicit increase in a kernel workgroup's shared memory requirements?
For example, do the workgroup (or subgroup) functions like scan or reduce quietly "reserve" SLM?
If there are any circumstances where this might happen on SB, IVB, HSW or BDW then could you list them?
The new OpenCL™ Code Anlayzer, a feature of Intel® INDE OpenCL™ Code Builder, adds performance analysis capabilities integrated in your Microsoft Visual Studio* OpenCL development environment.
With this new feature, previously in preview, the OpenCL Code Builder now supports each state of the OpenCL code development, enables you to carry on performance optimizations in each step of the development from build, to debug, and to tuning and get the best out of Intel® Graphics Compute capabilities.
OpenCL™ Code Builder in conjuction with Intel® INDE Platfrom Analyzer now enables tuning features for OpenCL applications like:
Need more information on the OpenCL Code Builder feature? Check this 2 min video: http://bcove.me/xrcs5bze
I am working on Decode-OPENCL-Encode pipeline on intel processor. There is a sample code provide by intel for media interop which is attached.
I am integrating the encoder into same.
If we look at the DecodeOneFrame() function below:
mfxStatus CDecodingPipeline::DecodeOneFrame(int Width, int Height, IDirect3DSurface9 *pDstSurface, IDirect3DDevice9* pd3dDevice)
{
mfxU16 nOCLSurfIndex=0;
mfxStatus stsOut = MFX_ERR_NONE;
if(m_Tasks[m_TaskIndex].m_DecodeSync || m_Tasks[m_TaskIndex].m_OCLSync || m_Tasks[m_TaskIndex].m_EncodeSync)
{// wait task is finished and copy result texture to back buffer
mfxStatus sts = MFX_ERR_NONE;
mfxFrameSurface1_OCL* pOutSurface = NULL; // output surface.
//wait the previous submitted tasks
if(m_Tasks[m_TaskIndex].m_DecodeSync)
{
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_DecodeSync, MSDK_DEC_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
pOutSurface = m_Tasks[m_TaskIndex].m_pDecodeOutSurface;
}
if(m_Tasks[m_TaskIndex].m_OCLSync)
{
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_OCLSync, MSDK_VPP_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
pOutSurface = m_Tasks[m_TaskIndex].m_pOCLOutSurface;
}
#ifdef ENCODER
if(m_Tasks[m_TaskIndex].m_EncodeSync)
{
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_EncodeSync, MSDK_ENC_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
//pOutSurface = m_Tasks[m_TaskIndex].m_pEncodeOutSurface;
}
#endif
if(pOutSurface)
{/* copy YUV texture to screen */
HRESULT hr;
IDirect3DSurface9* pSrcSurface = (IDirect3DSurface9*)(pOutSurface->Data.MemId);
assert(pDstSurface && pSrcSurface);
if(pSrcSurface && pDstSurface)
{
RECT r;
r.left = 0;
r.top = 0;
r.right = min(Width,m_mfxDecodeVideoParams.vpp.In.Width);
r.bottom = min(Height,m_mfxDecodeVideoParams.vpp.In.Height);
r.right -= r.right&1;
r.bottom -= r.bottom&1;
V(pd3dDevice->StretchRect(pSrcSurface, &r, pDstSurface, &r,D3DTEXF_POINT));
}
}
#ifdef UNLOCK
if(m_Tasks[m_TaskIndex].m_pDecodeOutSurface && m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked)
_InterlockedDecrement16((short*)&m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked);
if(m_Tasks[m_TaskIndex].m_pOCLOutSurface && m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked)
_InterlockedDecrement16((short*)&m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked);
#ifdef ENCODER
if(m_Tasks[m_TaskIndex].m_pEncodeOutSurface && m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked)
_InterlockedDecrement16((short*)&m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked);
#endif
#endif
}
#if 1
// clear sync task for further using
m_Tasks[m_TaskIndex].m_OCLSync = 0;
m_Tasks[m_TaskIndex].m_pOCLOutSurface = 0;
m_Tasks[m_TaskIndex].m_DecodeSync = 0;
m_Tasks[m_TaskIndex].m_pDecodeOutSurface = 0;
#ifdef ENCODER
m_Tasks[m_TaskIndex].m_EncodeSync = 0;
m_Tasks[m_TaskIndex].m_pEncodeOutSurface = 0;
#endif
#endif
if(m_DECODEFlag)
{// feed decoder
mfxSyncPoint DecodeSyncPoint = 0;
static mfxU16 nDecoderSurfIndex = 0; // index of free surface
mfxStatus sts = MFX_ERR_NONE;
m_pmfxDecodeSurfaceLast = NULL; // reset curretn decoder surface to get new one from Decoder
while(MFX_ERR_NONE <= sts || MFX_ERR_MORE_DATA == sts || MFX_ERR_MORE_SURFACE == sts || MFX_WRN_DEVICE_BUSY == sts)
{// loop until decoder report that it get request for new frame
if (MFX_WRN_DEVICE_BUSY == sts)
{
Sleep(1); // just wait and then repeat the same call to DecodeFrameAsync
}
else if (MFX_ERR_MORE_DATA == sts)
{ // read more data to input bit stream
sts = m_FileReader.ReadNextFrame(&m_mfxBS);
MSDK_BREAK_ON_ERROR(sts);
}
else if (MFX_ERR_MORE_SURFACE == sts || MFX_ERR_NONE == sts)
{// find new working-output surface in m_pmfxDecodeSurfaces
//nDecoderSurfIndex = 0;
nDecoderSurfIndex = GetFreeSurfaceIndex(m_pmfxDecodeSurfaces, m_mfxDecoderResponse.NumFrameActual,nDecoderSurfIndex);
if (MSDK_INVALID_SURF_IDX == nDecoderSurfIndex)
{
return MFX_ERR_MEMORY_ALLOC;
}
}
// send request to decoder
sts = m_pmfxDEC->DecodeFrameAsync(
&m_mfxBS,
(mfxFrameSurface1*)&(m_pmfxDecodeSurfaces[nDecoderSurfIndex]),
(mfxFrameSurface1**)&m_pmfxDecodeSurfaceLast,
&DecodeSyncPoint);
// ignore warnings if output is available,
// if no output and no action required just repeat the same call
if (MFX_ERR_NONE < sts && DecodeSyncPoint)
{
sts = MFX_ERR_NONE;
}
if (MFX_ERR_NONE == sts)
{// decoder return sync point then fill the curretn task nad switch to OCL Plugin feeding
m_Tasks[m_TaskIndex].m_DecodeSync = DecodeSyncPoint;
m_Tasks[m_TaskIndex].m_pDecodeOutSurface = m_pmfxDecodeSurfaceLast;
// look for output process
#ifdef UNLOCK
if(m_Tasks[m_TaskIndex].m_pDecodeOutSurface)
_InterlockedIncrement16((short*)&m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked);
#endif
break;
}
}
if(MFX_ERR_NONE != sts)
{
printf("ERROR: Decoder returns error %d!\n",sts);
stsOut = sts;
}
//decoder sync point
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_DecodeSync, MSDK_DEC_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
}//if(m_DECODEFlag)
if(m_pOCLPlugin && m_pOCLPlugin->m_OCLFlag)
{// OPENCL part
mfxU16 nOCLSurfIndex=0;
mfxSyncPoint OCLSyncPoint = 0;
mfxStatus sts = MFX_ERR_NONE;
// get index for output surface for OCL plugin
nOCLSurfIndex = GetFreeSurfaceIndex(m_pmfxOCLSurfaces, m_mfxOCLResponse.NumFrameActual);
MSDK_CHECK_ERROR(nOCLSurfIndex, MSDK_INVALID_SURF_IDX, MFX_ERR_MEMORY_ALLOC);
//mfxHDL pOutSurf = &m_pmfxOCLSurfaces[nOCLSurfIndex];
//mfxHDL pOutSurf = &m_pmfxEncSurfaces[nEncSurfIdx];
//m_pmfxOCLSurfaces[nOCLSurfIndex] = m_pmfxEncSurfaces[nEncSurfIdx];
mfxHDL pOutSurf = &m_pmfxOCLSurfaces[nOCLSurfIndex];
mfxHDL inp = m_pmfxDecodeSurfaceLast;
// OCL filter
for(;;)
{
sts = MFXVideoUSER_ProcessFrameAsync(m_mfxSession, &inp, 1, &pOutSurf, 1, &OCLSyncPoint);
if (MFX_WRN_DEVICE_BUSY == sts)
{
Sleep(1); // just wait and then repeat the same call
}
else
{
break;
}
}
// ignore warnings if output is available,
if (MFX_ERR_NONE < sts && OCLSyncPoint)
{
sts = MFX_ERR_NONE;
}
if(MFX_ERR_NONE!=sts)
{
printf("ERROR: OpenCL filter return error %d!\n",sts);
return sts;
}
{
m_Tasks[m_TaskIndex].m_OCLSync = OCLSyncPoint;
m_Tasks[m_TaskIndex].m_pOCLOutSurface = &m_pmfxOCLSurfaces[nOCLSurfIndex];
//m_Tasks[m_TaskIndex].m_pOCLOutSurface = &m_pmfxEncSurfaces[nEncSurfIdx];
#ifdef UNLOCK
// look for output process
_InterlockedIncrement16((short*)&m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked);
#endif
}
}
#ifdef ENCODER
if(m_ENCODEFlag)
{// feed encoder
static mfxU16 nEncSurfIdx = 0; // index of free surface
mfxSyncPoint EncSyncP;
mfxStatus sts = MFX_ERR_NONE;
//mfxFrameSurface1* pSurf = NULL; // dispatching pointer
// find free surface for encoder input
nEncSurfIdx = GetFreeSurface(m_pmfxEncSurfaces, m_mfxEncResponse.NumFrameActual);
MSDK_CHECK_ERROR(nEncSurfIdx, MSDK_INVALID_SURF_IDX, MFX_ERR_MEMORY_ALLOC);
// point pSurf to encoder surface
//m_pmfxEncSurfaces[nEncSurfIdx] = m_pmfxOCLSurfaces[nOCLSurfIndex];
for (;;)
{
// at this point surface for encoder contains either a frame from file or a frame processed by vpp
sts = m_pmfxENC->EncodeFrameAsync(NULL, &m_pmfxEncSurfaces[nEncSurfIdx], &m_mfxEncBS, &EncSyncP);
if (MFX_ERR_NONE < sts && !EncSyncP) // repeat the call if warning and no output
{
if (MFX_WRN_DEVICE_BUSY == sts)
MSDK_SLEEP(1); // wait if device is busy
}
else if (MFX_ERR_NONE < sts && EncSyncP)
{
sts = MFX_ERR_NONE; // ignore warnings if output is available
break;
}
else if (MFX_ERR_NOT_ENOUGH_BUFFER == sts)
{
sts = AllocateSufficientBuffer(&m_mfxEncBS);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
printf("\n BUFFER allocated");
}
else
{
// get next surface and new task for 2nd bitstream in ViewOutput mode
MSDK_IGNORE_MFX_STS(sts, MFX_ERR_MORE_BITSTREAM);
break;
}
}
if (MFX_ERR_MORE_DATA == sts) {
sts = MFX_ERR_NONE;
}
if (MFX_ERR_NONE == sts)
{
m_Tasks[m_TaskIndex].m_EncodeSync = EncSyncP;
}
#ifdef UNLOCK
if (MFX_ERR_NONE == sts)
{// encoder return sync point then fill the curretn task nad switch to encoder feeding
m_Tasks[m_TaskIndex].m_pEncodeOutSurface = &m_pmfxEncSurfaces[nEncSurfIdx];
_InterlockedIncrement16((short*)&m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked);
}
#endif
}
#endif
// increase task index to point to next task.
m_TaskIndex = (m_TaskIndex+1)%SYNC_BUF_SIZE;
return stsOut;
}//CDecodingPipeline::DecodeOneFrame
If I use this code the encoder output is corrupted.When I decode the encoder output It seems frames are not displayed in proper order.
I think I am not giving the right surface to encoder as the encoder surface index is independently calculated.
But when I give the OCL output surface to encoder then my OCL also stops working.
Can anyone guide me here?
Attachment | Size |
---|---|
Download![]() | 298.98 KB |
I am working on Decode-OPENCL-Encode pipeline on intel processor. There is a sample code provide by intel for media interop which is attached.
I am integrating the encoder into same.
If we look at the DecodeOneFrame() function below:
mfxStatus CDecodingPipeline::DecodeOneFrame(int Width, int Height, IDirect3DSurface9 *pDstSurface, IDirect3DDevice9* pd3dDevice)
{
mfxU16 nOCLSurfIndex=0;
mfxStatus stsOut = MFX_ERR_NONE;
if(m_Tasks[m_TaskIndex].m_DecodeSync || m_Tasks[m_TaskIndex].m_OCLSync || m_Tasks[m_TaskIndex].m_EncodeSync)
{// wait task is finished and copy result texture to back buffer
mfxStatus sts = MFX_ERR_NONE;
mfxFrameSurface1_OCL* pOutSurface = NULL; // output surface.
//wait the previous submitted tasks
if(m_Tasks[m_TaskIndex].m_DecodeSync)
{
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_DecodeSync, MSDK_DEC_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
pOutSurface = m_Tasks[m_TaskIndex].m_pDecodeOutSurface;
}
if(m_Tasks[m_TaskIndex].m_OCLSync)
{
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_OCLSync, MSDK_VPP_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
pOutSurface = m_Tasks[m_TaskIndex].m_pOCLOutSurface;
}
#ifdef ENCODER
if(m_Tasks[m_TaskIndex].m_EncodeSync)
{
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_EncodeSync, MSDK_ENC_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
//pOutSurface = m_Tasks[m_TaskIndex].m_pEncodeOutSurface;
}
#endif
if(pOutSurface)
{/* copy YUV texture to screen */
HRESULT hr;
IDirect3DSurface9* pSrcSurface = (IDirect3DSurface9*)(pOutSurface->Data.MemId);
assert(pDstSurface && pSrcSurface);
if(pSrcSurface && pDstSurface)
{
RECT r;
r.left = 0;
r.top = 0;
r.right = min(Width,m_mfxDecodeVideoParams.vpp.In.Width);
r.bottom = min(Height,m_mfxDecodeVideoParams.vpp.In.Height);
r.right -= r.right&1;
r.bottom -= r.bottom&1;
V(pd3dDevice->StretchRect(pSrcSurface, &r, pDstSurface, &r,D3DTEXF_POINT));
}
}
#ifdef UNLOCK
if(m_Tasks[m_TaskIndex].m_pDecodeOutSurface && m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked)
_InterlockedDecrement16((short*)&m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked);
if(m_Tasks[m_TaskIndex].m_pOCLOutSurface && m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked)
_InterlockedDecrement16((short*)&m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked);
#ifdef ENCODER
if(m_Tasks[m_TaskIndex].m_pEncodeOutSurface && m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked)
_InterlockedDecrement16((short*)&m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked);
#endif
#endif
}
#if 1
// clear sync task for further using
m_Tasks[m_TaskIndex].m_OCLSync = 0;
m_Tasks[m_TaskIndex].m_pOCLOutSurface = 0;
m_Tasks[m_TaskIndex].m_DecodeSync = 0;
m_Tasks[m_TaskIndex].m_pDecodeOutSurface = 0;
#ifdef ENCODER
m_Tasks[m_TaskIndex].m_EncodeSync = 0;
m_Tasks[m_TaskIndex].m_pEncodeOutSurface = 0;
#endif
#endif
if(m_DECODEFlag)
{// feed decoder
mfxSyncPoint DecodeSyncPoint = 0;
static mfxU16 nDecoderSurfIndex = 0; // index of free surface
mfxStatus sts = MFX_ERR_NONE;
m_pmfxDecodeSurfaceLast = NULL; // reset curretn decoder surface to get new one from Decoder
while(MFX_ERR_NONE <= sts || MFX_ERR_MORE_DATA == sts || MFX_ERR_MORE_SURFACE == sts || MFX_WRN_DEVICE_BUSY == sts)
{// loop until decoder report that it get request for new frame
if (MFX_WRN_DEVICE_BUSY == sts)
{
Sleep(1); // just wait and then repeat the same call to DecodeFrameAsync
}
else if (MFX_ERR_MORE_DATA == sts)
{ // read more data to input bit stream
sts = m_FileReader.ReadNextFrame(&m_mfxBS);
MSDK_BREAK_ON_ERROR(sts);
}
else if (MFX_ERR_MORE_SURFACE == sts || MFX_ERR_NONE == sts)
{// find new working-output surface in m_pmfxDecodeSurfaces
//nDecoderSurfIndex = 0;
nDecoderSurfIndex = GetFreeSurfaceIndex(m_pmfxDecodeSurfaces, m_mfxDecoderResponse.NumFrameActual,nDecoderSurfIndex);
if (MSDK_INVALID_SURF_IDX == nDecoderSurfIndex)
{
return MFX_ERR_MEMORY_ALLOC;
}
}
// send request to decoder
sts = m_pmfxDEC->DecodeFrameAsync(
&m_mfxBS,
(mfxFrameSurface1*)&(m_pmfxDecodeSurfaces[nDecoderSurfIndex]),
(mfxFrameSurface1**)&m_pmfxDecodeSurfaceLast,
&DecodeSyncPoint);
// ignore warnings if output is available,
// if no output and no action required just repeat the same call
if (MFX_ERR_NONE < sts && DecodeSyncPoint)
{
sts = MFX_ERR_NONE;
}
if (MFX_ERR_NONE == sts)
{// decoder return sync point then fill the curretn task nad switch to OCL Plugin feeding
m_Tasks[m_TaskIndex].m_DecodeSync = DecodeSyncPoint;
m_Tasks[m_TaskIndex].m_pDecodeOutSurface = m_pmfxDecodeSurfaceLast;
// look for output process
#ifdef UNLOCK
if(m_Tasks[m_TaskIndex].m_pDecodeOutSurface)
_InterlockedIncrement16((short*)&m_Tasks[m_TaskIndex].m_pDecodeOutSurface->Data.Locked);
#endif
break;
}
}
if(MFX_ERR_NONE != sts)
{
printf("ERROR: Decoder returns error %d!\n",sts);
stsOut = sts;
}
//decoder sync point
sts = m_mfxSession.SyncOperation(m_Tasks[m_TaskIndex].m_DecodeSync, MSDK_DEC_WAIT_INTERVAL);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
}//if(m_DECODEFlag)
if(m_pOCLPlugin && m_pOCLPlugin->m_OCLFlag)
{// OPENCL part
mfxU16 nOCLSurfIndex=0;
mfxSyncPoint OCLSyncPoint = 0;
mfxStatus sts = MFX_ERR_NONE;
// get index for output surface for OCL plugin
nOCLSurfIndex = GetFreeSurfaceIndex(m_pmfxOCLSurfaces, m_mfxOCLResponse.NumFrameActual);
MSDK_CHECK_ERROR(nOCLSurfIndex, MSDK_INVALID_SURF_IDX, MFX_ERR_MEMORY_ALLOC);
//mfxHDL pOutSurf = &m_pmfxOCLSurfaces[nOCLSurfIndex];
//mfxHDL pOutSurf = &m_pmfxEncSurfaces[nEncSurfIdx];
//m_pmfxOCLSurfaces[nOCLSurfIndex] = m_pmfxEncSurfaces[nEncSurfIdx];
mfxHDL pOutSurf = &m_pmfxOCLSurfaces[nOCLSurfIndex];
mfxHDL inp = m_pmfxDecodeSurfaceLast;
// OCL filter
for(;;)
{
sts = MFXVideoUSER_ProcessFrameAsync(m_mfxSession, &inp, 1, &pOutSurf, 1, &OCLSyncPoint);
if (MFX_WRN_DEVICE_BUSY == sts)
{
Sleep(1); // just wait and then repeat the same call
}
else
{
break;
}
}
// ignore warnings if output is available,
if (MFX_ERR_NONE < sts && OCLSyncPoint)
{
sts = MFX_ERR_NONE;
}
if(MFX_ERR_NONE!=sts)
{
printf("ERROR: OpenCL filter return error %d!\n",sts);
return sts;
}
{
m_Tasks[m_TaskIndex].m_OCLSync = OCLSyncPoint;
m_Tasks[m_TaskIndex].m_pOCLOutSurface = &m_pmfxOCLSurfaces[nOCLSurfIndex];
//m_Tasks[m_TaskIndex].m_pOCLOutSurface = &m_pmfxEncSurfaces[nEncSurfIdx];
#ifdef UNLOCK
// look for output process
_InterlockedIncrement16((short*)&m_Tasks[m_TaskIndex].m_pOCLOutSurface->Data.Locked);
#endif
}
}
#ifdef ENCODER
if(m_ENCODEFlag)
{// feed encoder
static mfxU16 nEncSurfIdx = 0; // index of free surface
mfxSyncPoint EncSyncP;
mfxStatus sts = MFX_ERR_NONE;
//mfxFrameSurface1* pSurf = NULL; // dispatching pointer
// find free surface for encoder input
nEncSurfIdx = GetFreeSurface(m_pmfxEncSurfaces, m_mfxEncResponse.NumFrameActual);
MSDK_CHECK_ERROR(nEncSurfIdx, MSDK_INVALID_SURF_IDX, MFX_ERR_MEMORY_ALLOC);
// point pSurf to encoder surface
//m_pmfxEncSurfaces[nEncSurfIdx] = m_pmfxOCLSurfaces[nOCLSurfIndex];
for (;;)
{
// at this point surface for encoder contains either a frame from file or a frame processed by vpp
sts = m_pmfxENC->EncodeFrameAsync(NULL, &m_pmfxEncSurfaces[nEncSurfIdx], &m_mfxEncBS, &EncSyncP);
if (MFX_ERR_NONE < sts && !EncSyncP) // repeat the call if warning and no output
{
if (MFX_WRN_DEVICE_BUSY == sts)
MSDK_SLEEP(1); // wait if device is busy
}
else if (MFX_ERR_NONE < sts && EncSyncP)
{
sts = MFX_ERR_NONE; // ignore warnings if output is available
break;
}
else if (MFX_ERR_NOT_ENOUGH_BUFFER == sts)
{
sts = AllocateSufficientBuffer(&m_mfxEncBS);
MSDK_CHECK_RESULT(sts, MFX_ERR_NONE, sts);
printf("\n BUFFER allocated");
}
else
{
// get next surface and new task for 2nd bitstream in ViewOutput mode
MSDK_IGNORE_MFX_STS(sts, MFX_ERR_MORE_BITSTREAM);
break;
}
}
if (MFX_ERR_MORE_DATA == sts) {
sts = MFX_ERR_NONE;
}
if (MFX_ERR_NONE == sts)
{
m_Tasks[m_TaskIndex].m_EncodeSync = EncSyncP;
}
#ifdef UNLOCK
if (MFX_ERR_NONE == sts)
{// encoder return sync point then fill the curretn task nad switch to encoder feeding
m_Tasks[m_TaskIndex].m_pEncodeOutSurface = &m_pmfxEncSurfaces[nEncSurfIdx];
_InterlockedIncrement16((short*)&m_Tasks[m_TaskIndex].m_pEncodeOutSurface->Data.Locked);
}
#endif
}
#endif
// increase task index to point to next task.
m_TaskIndex = (m_TaskIndex+1)%SYNC_BUF_SIZE;
return stsOut;
}//CDecodingPipeline::DecodeOneFrame
If I use this code the encoder output is corrupted.When I decode the encoder output It seems frames are not displayed in proper order.
I think I am not giving the right surface to encoder as the encoder surface index is independently calculated.
But when I give the OCL output surface to encoder then my OCL also stops working.
Can anyone guide me here?
Attachment | Size |
---|---|
Download![]() | 393.51 KB |
Is there a opencl 1.2 CPU driver for a "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz" that support khr_gl_sharing with the HD2000 GPU?
If I install the OpenCL™ Runtime 15.1 I get a opencl 1.2 cpu driver but it has no khr_gl_sharing
The opencl 1.1 cpu driver that comes with the graphics driver does support khr_gl_sharing
Does Intel support opencl CPU khr_gl_sharing with older Intel GMA graphics chipsets (for opencl 1.1? opencl 1.2?)
What about non-intel graphics chipsets (Nvidia, AMD etc)
Is there a table of what is supported on what?
On my Haswell machine the 4.2.0.148(?) opencl driver is opencl 1.2 and supports cl_khr_gl_sharing
if I install the OpenCL™ Runtime 15.1 on it I get a 5.0.? cpu driver without khr_gl_sharing
Anyway. My application need khr_gl_sharing and prefers opencl 1.2. I need to be able to tell people what works and what doesn't.
Hi all,
I've installed Visual Studio 2012 express edition under Windows7 64-bit and installed Intel INDE Starter edition with code_builder_5.1.0.25.
Next I opened a Visual Studio x64 command prompt, which points initially to "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC", changed
directory to "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC" and ran VCVARSALL.BAT. Without this it would not find <iostream> etc.When I try to compile my OpenCL program, which is based on cl.hpp, using the command line:
cl -o myprog.exe -O2 myprog.cpp -I"C:\Intel\INDE\code_builder_5.1.0.25\include""C:\Intel\INDE\code_builder_5.1.0.25\lib\x64\OpenC L.lib"
There are lots of errors of the kind:
myprog.obj : error LNK2019: unresolved external symbol _clGetPlatformIDs@12 referenced in function "public: static int __cdecl cl::Platform::get
(class std::vector<class cl::Platform, class std::allocator<class cl::Platform> > *)" (?get@Platform@cl@@SAHPAV?$vector@VPlatform@cl$all ocator@VPlatform@cstd@@@std@@@|)and another 23 of this kind.
Can you please help?
Many thanks!
Hi,
I want to install Intel Inde 2015 update 2 on a computer with windows 8.1 and Intel SDK for OpenCL Applications 2014.
On my computer is installed Intel SDK for OpenCL Applications 2014. I tried to uninstall it from Windows control panel but the installation was stuck. After waiting more than 10 hours I turned off the computer. (Cancel button was not helpful).
I invoked the installation of Intel INDE 2015 Update 2 but it was stuck in the removing phase of Intel SDK for OpenCL Applications 2014. The same behavior was observed on 2 different computers.
Attached the installation user interface.
What is the easiest way to manually uninstall OpenCL 2014 SDK?
Please advise,
Best regards,
Micha
Attachment | Size |
---|---|
Download![]() | 60.98 KB |
Will this poster be made available online:
Accelerating SGEMM with Subgroups
The concept of a subgroup was introduced in the OpenCL 2.0 spec and is an optional Khronos OpenCL extension. This poster will describe work done at Intel to accelerate the SGEMM matrix multiplication algorithm on Intel GPUs using subgroups. Using subgroups, we were able to achieve SGEMM performance results that were comparable to our best hand-written assembler results.
?
Hi,
I use OpenCL on Intel® Core™ i7-4980HQ Processor.
The problem is:
When the CPU and GPU write to the same cacheline, results sometimes get wrong.
The CPU and GPU both use OpenCL kernels.
Is there any mechanism to solve it? Atomic operation or something?
Hope for your reply,
Thanks!
Hi all,
I intend to compile and run a simple OpenCL application in Ubuntu 15.04 (x86_64). The application has been written by C programming language. I have compiled the application by GCC on a PC consists of an Intel dual core and an Intel graphic card. But there is a problem with #include "CL/cl.h". Could you tell me what packages should be installed in Ubuntu and where the packages should be downloaded?
Regards,
Hi,
we would like to provide OpenCL support to Intel Core and Xeon processors and Intel Xeon Phi coprocessors on our cluster. On the online documentation I read that "For Intel® Xeon Phi™ coprocessor support, you must install the OpenCL runtime 14.2 here, and the Intel® Manycore Platform Software Stack (Intel® MPSS) 3.3 here".
Our cluster has around 380 nodes, each node equipped with 2 octa-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz and 2 16GB Intel Xeon-Phi 7120P Accelerator. The OS is CentOS Linux release 7.0.1406, and the MPSS is 3.5-1 version.
I read on a very old post (https://software.intel.com/en-us/forums/topic/536938) that "I (Yuri Kulakov,Intel) may just confirm information regarding approved MPSS version. Intel OpenCL™ Runtime 14.2 was validated with MPSS 3.3 (as per release notes)", apologies if I missed additional posts on the matter. Since that post dates back to last year, I was wondering if it still applies, in particular:
1) does it mean that the OpenCL runtime won't work with our MPSS 3.5?
or
2) does it mean that we cannot know if it will work - hence we can try to install the Runtime and the Code Builder?
Thank you very much in advance,
Isabella
I am the developer of VexCL - an opensource C++ library for OpenCL (https://github.com/ddemidov/vexcl). I'd like to use Intel OpenCL Code Builder as a part of continuous integration process at AppVeyour.com. In order to do this, the Code Builder has to be installed from a build script. It is possible to install just the Code Builder from online installer of Intel® INDE (https://software.intel.com/en-us/articles/getting-started-with-opencl-de...). The installer at some point says it is downloading code_builder_x64_setup.msi. Would it be possible to get a direct link for the msi package?
I'm working with code builder. I've been running happily for six months with the 2014 ocl applications SDK under VS2010. I thought I knew all the tricks to ensure kernel breakpoints were hit, but one (large) new project consistently fails to hit breakpoints. I figured I can't go to forums with an old version, so I tried installing INDE with the restricted options for just a Code Builder installation, but that just stripped out my old VS2010 code builder extensions. So I took a deep breath and installed the whole INDE, with the same results (later I noticed that INDE did install code builder extensions for VS2013: does it not work for 2010 anymore???).
Anyway sadly all in vain: still no breakpoints getting hit. Code builder has "Enable opencl kernel debugging' set, and the "Enable OpenCL API debugger" is also set, for what its worth. Work items are set for 0,0,0. Debugging is on the local machine. Kernels were built with the -g -s <source> option (no other options), and the source is certainly the right one. No offsets. Device is my local i7-4770. Windows 7 Enterprise SP1 64 bit. Latest Iris and HD graphics driver (15.36.19.64.4170). Haven't checked the OpenCL runtime, but presume that was updated with INDE..?
I have a one thread dummy kernel that just prints its local thread:
__kernel void debugTest() { printf("%d\n",get_local_id(0)); }
The kernel dutifully runs and prints. It just ignores the breakpoint and completes normally.
Please am I missing anything?? Getting stressed.
I was trying to intall Intel OpenCL Runtime 15.1 for Ubuntu. I have a Xeon E3 Haswell.
Now, release notes clearly state 14.04 *is supported*: https://software.intel.com/file/448151/download , and installation instructions are pretty clear.
BUT, when I try to download the runtime from the official download page
https://software.intel.com/en-us/articles/opencl-drivers#ubuntu64
it downloads Intel Code Builder for Ubuntu, which supports 12.04 only.
How can I get runtime 15.1 installed on my Ubuntu 14.04?
I can find the cl_khr_fp64 extension, required for double precision, on a Ubuntu 14.04 running on Amazon EC2 using Intel Xeon and Intel Code Builder for OpenCL API 2015 for Ubuntu:
--------------------------------------------------------------------------- OS name: Linux Release:3.13.0-52-generic Version:#86-Ubuntu SMP Mon May 4 04:32:59 UTC 2015 Machine:x86_64 Platform name = Intel(R) OpenCL Platform version = OpenCL 1.2 LINUX Platform extensions = cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64Device name = Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Device version = OpenCL 1.2 (Build 43)
Device global memory size= 1040732160
Device available? Yes
Device extensions= cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64
Following double precision features supported:
INF and NaN values
Denormalized numbers
Round To Nearest Even mode
Round To Infinity mode
Round To Zero mode
Floating-point multiply-and-add operation----------------------------------------------------------------------------
I can also see that extension for Core i7-2620M:
But I dont see that extension on my home system with Ubuntu 14.04 running as a VirtualBox machine within Windows 7 using Core i7-2620M and the same Intel Code Builder for OpenCL API 2015 for Ubuntu, as the one on Amazon EC2:
-----------------------------------------------------------------------------------------------------------------------
OS name: Linux Release:3.13.0-53-generic Version:#89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 Machine:x86_64 Platform name = Intel(R) OpenCL Platform version = OpenCL 1.2 LINUX Platform extensions = cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir Device name = Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz Device version = OpenCL 1.2 (Build 43) Device global memory size= 3156189184 Device available? Yes Device extensions= cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir Double precision not supported ------------------------------------------------------------------------
How can I get that extension for my home system above?
Thanks in advance!
PS: Checked for double precision using...
cl_device_fp_config flag; ret = clGetDeviceInfo(device_id, CL_DEVICE_DOUBLE_FP_CONFIG, sizeof(flag), &flag, NULL); if (!flag) printf("Double precision not supported \n\n"); else printf("Following double precision features supported:\n"); if(flag & CL_FP_INF_NAN) printf(" INF and NaN values\n"); if(flag & CL_FP_DENORM) printf(" Denormalized numbers\n"); if(flag & CL_FP_ROUND_TO_NEAREST) printf(" Round To Nearest Even mode\n"); if(flag & CL_FP_ROUND_TO_INF) printf(" Round To Infinity mode\n"); if(flag & CL_FP_ROUND_TO_ZERO) printf(" Round To Zero mode\n"); if(flag & CL_FP_FMA) printf(" Floating-point multiply-and-add operation\n\n");