Skip to content

Pipelining: Device to Device

Hüseyin Tuğrul BÜYÜKIŞIK edited this page May 12, 2017 · 32 revisions

For versions before v1.2.0, only separable kernels had the option to be pipelined concurrently for multiple GPUs. But some projects having multiple non-separable kernels could not get any help from that feature. Generally more hardware-dependent in-kernel commands like "atomic functions" which use any global memory address whenever needed, make the "partitioning" impossible.

To bypass difficulties of partitioning of multiple non-partitionable kernels, version 1.2.0 of Cekirdekler API adds "device to device" pipelining feature. This technique binds a different device for each stage of a multi-stage pipeline and runs all stages concurrently. Concurrency of stages is also expanded to include double buffering that makes inter-stage data movements' latencies hidden behind the compute time of stages (if its greater).

This is how it works:

Creating a pipeline needs the stages to be created first. A pipeline stage is created with kernels, input-output(and hidden for sequential logic) buffers and a device.

Pipeline stage for addition of "1" to array elements:

Hardware.ClDevices gpu1 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[0];
Pipeline.ClPipelineStage add = new Pipeline.ClPipelineStage();
add.addDevices( gpu1);
ClArray<float> inputArrayGPU1 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU1 = new ClArray<float>(1024);
add.addInputBuffers(inputArrayGPU1);
add.addOutputBuffers(outputArrayGPU1);
add.addKernels(@"
__kernel void vecAdd(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]+1.0f; 
}", "vecAdd", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ }); 

pipeline stage for multiplication of array elements by 5:

Hardware.ClDevices gpu2 = Hardware.ClPlatforms.all().devicesWithMostComputeUnits()[1];
Pipeline.ClPipelineStage mul = new Pipeline.ClPipelineStage();
mul .addDevices( gpu2);
ClArray<float> inputArrayGPU2 = new ClArray<float>(1024);
ClArray<float> outputArrayGPU2 = new ClArray<float>(1024);
mul .addInputBuffers(inputArrayGPU2);
mul .addOutputBuffers(outputArrayGPU2);
mul .addKernels(@"
__kernel void vecMul(__global float * input, __global float * output)
{ 
    int id=get_global_id(0); 
    output[id]=input[id]*5.0f; 
}", "vecMul", new int[] { 1024/*global range*/ }, new int[] { 256/* local range*/ }); 

setting the initializing parameters for a stage:

stage.initializerKernel("kernelName1 kernelName2;kernelName3,kernelName4", new int[] { N,N,N/2,N/4 }, new int[] { 256,256,256,256 });

names of the initializer kernels indicate the functions defined in addKernels() method. N is global range of each kernel, 256 is a trivial example of local range for kernels

building a pipeline from add+mul stages:

add.prependToStage(mul); // prepends add stage to mul stage
var pipeline = add.makePipeline(); // creates a 2-stage pipeline out of add and mul.

makePipeline() method also initializes all stages with the specified initializer kernels and their range values in the initializerKernel() methods of stages.

pushing data to pipeline actually runs it:

pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult })

so the pushed data travels further in the pipeline at each pushData() method call until it arrives to target(result) array.

pushData() method returns a boolean value indicating that the result is ready or not. If true, the first input values has been computed on all pipeline stages and already copied to output array(or any result array given by client code). So it can be directly used like this:

if(pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("First result is ready!");
    // uses the result for other tasks when its ready (instead of garbage results for the first N iterations)
}

arrayToGetData: is given in an array of object that must be exact same length and type of the input array of first stage. If there are multiple input arrays in the first stage, then there has to be multiple arrays given as an array of objects.

arrayToReceiveResult: is a target for copying results from last stage's output arrays. Similar to input arrays, output arrays has to be same type and length with this receiver array.

To get the results sooner(while it doesn't change the throughput), data may not be pushed and popped, instead, input-output arrays of first(and last) stages can be directly accessed. Not using the pushData method's parameters makes the pipeline not flip the double buffers at the both ends of pipeline. This way, a result can be taken 2 iterations quicker.

Examples:

M is number of stages in the pipeline.

if(pipeline.pushData(new object[] { arrayToGetData }, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("Extra client arrays for both inputs and outputs of pipeline");
    Console.WriteLine("Needs M*2+1 iterations");  
    Console.WriteLine("arrayToGetData  and arrayToReceiveResult are used");  
}

if(pipeline.pushData(null, new object[] { arrayToReceiveResult }))
{
    Console.WriteLine("Extra client arrays for only outputs of pipeline");
    Console.WriteLine("Needs M*2 iterations"); 
    Console.WriteLine("input arrays of first stage are directly accessed");   
}

if(pipeline.pushData(new object[] { arrayToGetData }))
{
    Console.WriteLine("Extra client arrays for only inputs pipeline");
    Console.WriteLine("Needs M*2 iterations");  
    Console.WriteLine("output arrays of last stage are directly accessed");
}

if(pipeline.pushData())
{
    Console.WriteLine("Needs M*2-1 iterations"); 
    Console.WriteLine("input arrays of first stage and output arrays of last stage are directly accessed"); 
}

Note: only horizontal(and linear) pipelines are supported for now. Future versions will include multi-layered stages(that run concurrently to each other)


Revisiting rules:

  • Input arrays given as parameter in pushData method must match with the first stage input arrays by their types and number (and order).
  • Output arrays given as parameter in pushData method must match with the last stage output arrays by their types and number (and order).
  • If any pushData parameter is given null value, results for an input is computed 1 iteration sooner but the output or the input may need to be taken/put as a direct access to arrays of stages.
  • "numberOfArrayElementsPerWorkItem" field of each input-ouput array per stage is important for copying whole array to-from a device. A byte array with "12" as the field is a proper input for a float3 array (4 from float type, 3 from float3's width, multiplication gives 12 bytes per float3 element)
  • Inputs, hidden arrays(for sequential logic) and outputs must be added with the same order of kernel functions arguments.
  • Hidden arrays do not move data to other arrays. They also don't have duplicates and used for sequential logic per stage.
  • (v1.2.3+) Kernel names can be separated with " ",",",";" and new line character. They run in the order they are written and use the global-local range values from the int array parameters in same method call for the execution. When some of them are separated with "@" character, those group of kernels are run at once without extra sync points between them. So using "@" character decreases number of both array copies and abundant sync points to decrease total latency of kernels. When kernel name separation is done without "@", first kernel should use input array(and optionally hidden array), intermediate kernels should use hidden arrays, last kernel should use output array(also optionally hidden array) since name separators other than "@" reads same input arrays and writes same output arrays repeatedly because device to device pipeline is meant for single kernel per device. "@" separated kernel names use same global and local ranges so the number of global and local ranges must be 1 for all of them. Example:
mulFunctionStage.addKernels(@"
__kernel void init(__global double * input,__global double * sequentialLogic, __global double * output)
{
     int threadId=get_global_id(0);
     sequentialLogic[threadId]+=0.1;
}

__kernel void sqrtFunction(__global double * input,__global double * sequentialLogic, __global double * output)
{
     int threadId=get_global_id(0);
     output[threadId]=sqrt(input[threadId])+sequentialLogic[threadId];
}", "sqrtFunction@init@init@init@init", new int[] { N}, new int[] { 256});

here a sqrtFunction kernel and 4x init functions use same N value as global range and 256 as local range values. If different separations are mixed:

"sqrtFunction@init@init@init@init someMoreKernels", new int[] { N,M}, new int[] { 256,128}
"a@b c@d,e@e;f f", new int[]{K/*@*/,L/*@*/,M/*@*/,N,N},new int[]{64/*@*/,64/*@*/,256/*@*/,128,256}