Skip to content

Consecutive(and repeated) Kernels

Hüseyin Tuğrul BÜYÜKIŞIK edited this page May 16, 2017 · 9 revisions

This is more appropriate for single-device scenarios. Developer can add more kernels by their names separated by space, comma, semicolon,minus and new-line-character. All name-listed kernels are executed one after another in same command queue and whole operation is taken as single operation for profiling and load balancing.

f.compute(gpu, 1, "sortParticles findNeighbors,calculateForces;movePArtices", 1024); 

For multiple-device multipe-kernel execution, there has to be multiple compute method calls since arrays are synchronized on only RAM(C# arrays or C++ arrays), not on devices. Results of all devices are joined in RAM, then it is read by all devices on the next compute call.


v1.2.6 adds a "kernel repeat" feature to decrease total latency when developer needs kernel(s) repated with less latency. This feature converts the workflow of

READ-DATA ---> COMPUTE ---> WRITE-RESULTS

to

READ-DATA ---> COMPUTE COMPUTE COMPUTE .... (N times total) COMPUTE ---> WRITE-RESULTS

so once a data is uploaded, it can be repeatedly computed before results are taken back to host side.

To repeat kernels, repeat count value needs to be specified before the compute method is called:

numberCruncher.repeatCount=N;

this value is 1 by default and can't be zero or negative(even if set, corrects to 1).

To repeate kernels with a parameter changer kernel (such as resetting a single counter value or a trivial variable only) at the end of each repeat iteration, a kernel name needs to be specified for number cruncher object:

numberCruncher.repeatKernelName="reinitializeCounters";

this effectively means

READ-DATA ---> kernel reinitializeCounters kernel reinitializeCounters .... kernel reinitializeCounters ---> WRITE-RESULTS

where "kernel" can be just a single kernel or a list of kernels separated by delimiter characters such as space and comma. The repeatKernelName kernel is executed with global range = local range parameter to have minimum latency cost on overall compute overhead. This means only single work group is executed because this is intended to re-initialize just a few values between every iteration of N-repeats.

Repeating option doesn't synchronize between multiple devices. It repeats in-device only.

When kernel repeat number is on the order of thousands, it saves nearly %50 of the compute time for light workloads. Here is a comparison between for-loop version and a "repeat" featured version:

Old way:

ClArray<byte> data0 = new ClArray<byte>(1024);
ClNumberCruncher cr = new ClNumberCruncher((ClPlatforms.all().gpus()[0]), @"
    __kernel void vecAdd(__global unsigned char * data)
    {
        int id=get_global_id(0);
        data[id]++; 
    }

    __kernel void vecAddNextStep(__global unsigned char * data)
    {
       int id=get_global_id(0);
       data[id]--;
    }
");
if (cr.errorCode() != 0)
   Console.WriteLine(cr.errorMessage());
data0.read = false;
data0.write = false;
benchStart();
cr.repeatCount = 1; // default, no init kernel will be used
cr.repeatKernelName = ""; // default, init data between iterations
for (int i=0;i< 10000; i++)
   data0.compute(cr, 1, "vecAdd", 1024);
cr.repeatCount = 1;
benchStop();

result: Elapsed time=1902 ms

New way:

ClArray<byte> data0 = new ClArray<byte>(1024);
ClNumberCruncher cr = new ClNumberCruncher((ClPlatforms.all().gpus()[0]), @"
    __kernel void vecAdd(__global unsigned char * data)
    {
        int id=get_global_id(0);
        data[id]++; 
    }

    __kernel void vecAddNextStep(__global unsigned char * data)
    {
       int id=get_global_id(0);
       data[id]--;
    }
");
if (cr.errorCode() != 0)
   Console.WriteLine(cr.errorMessage());
data0.read = false;
data0.write = false;
benchStart();
cr.repeatCount = 10000; 
cr.repeatKernelName = ""; // default, init data between iterations

data0.compute(cr, 1, "vecAdd", 1024);

cr.repeatCount = 1;
benchStop();

so the only change is moving the for loop iterations value to the repeatCount field. Result: Elapsed time=36 ms

This is 60 times faster. Reducing unnecessary host-device synchronizations helps noticably when kernel workload is light as in the example. Since iteration initializer kernel is run with only single work group, it adds minimal latency:

ClArray<byte> data0 = new ClArray<byte>(1024);
ClNumberCruncher cr = new ClNumberCruncher((ClPlatforms.all().gpus()[0]), @"
    __kernel void vecAdd(__global unsigned char * data)
    {
        int id=get_global_id(0);
        data[id]++; 
    }

    __kernel void vecAddNextStep(__global unsigned char * data)
    {
       int id=get_global_id(0);
       data[id]--;
    }
");
if (cr.errorCode() != 0)
   Console.WriteLine(cr.errorMessage());
data0.read = false;
data0.write = false;
benchStart();
cr.repeatCount = 10000; 
cr.repeatKernelName = "vecAddNextStep"; // now this runs 10000 times too, with just 256 threads (default here)

data0.compute(cr, 1, "vecAdd", 1024);

cr.repeatCount = 1;
benchStop();

Result: Elapsed time=72 ms

so for the device side, computing a 1024 thread kernel with just a byte increment operation per thread is not very different than 256 thread kernel. The difference of timing (1024 thread vs 256 thread) is less than C# stopwatch time measurement error. Maybe nanoseconds level. To see the actual time difference of "reinitializer kernel" and "vecAdd", work size can be increased:

  • Workitems: 1024*1024, repeat count: 1000, result: Elapsed time=153 ms
  • Workitems: 1024*1024, repeat count: 1000, repeat kernel enabled, result: Elapsed time=170 ms
  • Workitems: 4*1024*1024, repeat count: 1000, result: Elapsed time=303 ms
  • Workitems: 4*1024*1024, repeat count: 1000, repeat kernel enabled, result: Elapsed time=294 ms
  • For loop with 1000 iterations containing a 4M-threaded-vecAdd kernel compute() and a 256-threaded-vecAddNextStep kernel compute(): Elapsed time=677 ms

the increased the amount work hides the "iteration transition" function latency behind the "error" of time measurement. Otherwise 2000 kernel executions shouldn't be faster than just 1000 kernel executions. Even with 4M threads, moving from for loop to repeat feature cuts %55 of latency because of embedding the iteration kernel in the single process.