Quantcast
Channel: Intel® Software - OpenCL*
Viewing all articles
Browse latest Browse all 1182

optimize kernel for vector addition

$
0
0

I have 2 vectors (float) with size of 1024*1024*8.  I want to do vector addition. My first kernel vec_add_1() has Gx=1024*1024*8 and Lx=0

__kernel void vec_add_1(__global const float* in1, __global const float* in2, __global float* out) {

    int i=get_global_id(0);

    out[i]=in1[i]+in2[i];

}

Kernel vec_add_1() takes about 10msec.

To reduce schedule time,  I created second kernel vec_add_2(). vec_add_1() has Gx=1024*1024*8 /4, Lx=0.

__kernel void vec_add_2(__global const float* in1, __global const float* in2, __global float* out) {

    int i=get_global_id(0);

    int j=(i<<2);

    out[j]=in1[j]+in2[j];
    out[j+1]=in1[j+1]+in2[j+1];
    out[j+2]=in1[j+2]+in2[j+2];
    out[j+3]=in1[j+3]+in2[j+3];
}

However,  I got 2 quite different results

- Running vec_add_2() in code builder session, vec_add_2() takes ~13msec, which is slower than vec_add_1() 

- Running vec_add_2() with host code together, vec_add_2() takes ~7msec, which is faster than vec_add_1()

So my questions are

- why running vec_add_2() with and without code builder session give quite different results?

- is vec_add_2() an optimized version than vec_add_1()?

 

thanks,

Jeffrey


Viewing all articles
Browse latest Browse all 1182

Trending Articles