Inspecting/Examining the MLIR-AIE codegen in Riallto #3

STFleming · 2024-01-10T14:48:43Z

STFleming
Jan 10, 2024
Maintainer

Riallto uses a mix of compilers to build applications for NPU devices. For the C++ VLIW vector-processor code, it uses a compiler called Chess, and for configuring data movement, it uses MLIR and an MLIR dialect called MLIR-AIE.

This blog will explore how we can inspect the generated MLIR-AIE code when constructing an NPU application with Riallto and explain some of the elements of the MLIR-AIE code. It aims to provide a slightly deeper dive into aspects of the Riallto compilation toolchain that the notebooks. For a higher-level overview, please refer to this notebook in the Riallto repository.

Riallto MLIR-AIE codegen

Let's say we have the following callgraph that defines a pipeline where a single passthrough kernel is passed each row of a 720p RGBA image:

from npu.build.appbuilder import AppBuilder

class PassthroughPipeline(AppBuilder):
    def callgraph(self, img_in, img_out):
        for row in range(720):
            x = passthrough(img_in[row])
            img_out[row] = x

In the callgraph method we can see that:

We are iterating over each row of the image.
We are passing each row of the input image, img_in into the passthrough kernel.
We are writing the output of passthrough into each row of the output image, img_out.

The passthrough kernel itself is just a simple memcpy copying the input data to the output:

%%kernel
void passthrough(uint8_t* img_in, uint8_t* img_out) {
    memcpy(img_out, img_in, 1280*4);
}

The callgraph above provides a high-level description of our pipeline. However, this description needs to be mapped spatially and temporally onto the NPU array to construct an application. The spatial mapping determines which kernels are communicating with each other and the connections between them. We have directly connected the interface tile to the threshold kernel in this case. The temporal mapping determines how data is partitioned and flows through the pipeline. We are splitting the image up row by row and passing each row into the pipeline. To perform this mapping, Riallto uses a tracer, where a behavioural execution of the pipeline is performed and traced to determine which kernels are communicating and the sequence of how the data flows through the pipeline.

Our AppBuilder class is callable, and when we provide it with some initial numpy arrays for inputs and outputs, it calls the application tracer. We can execute the tracer with the following:

img_in = np.zeros(shape=(720,1280,4), dtype=np.uint8)
img_out = np.zeros(shape=(720,1280,4), dtype=np.uint8)
app = PassthroughPipeline()
app(img_in, img_out)

Tracing the application produces metadata detailing the communication patterns of the kernels and how data moves through them. It's possible to view this metadata in a Jupyter Notebook with the following:

app.metadata

Riallto uses the metadata to codegen MLIR-AIE. The MLIR-AIE code describes more concretely the mapping of the application pipeline onto the NPU array and how data flows through the pipeline.

To view the MLIR-AIE code generated from the above example, run the following command in a Jupyter Notebook after the tracer has completed:

app_builder.displaymlir()

The above command should show the MLIR-AIE code below. Remember, this is machine-generated, low-level code; it might look overwhelming. However, in the following section, we will discuss portions to explain what is happening and give you a feel for how this describes the application. For a more detailed description of the MLIR-AIE operations in use here, you should examine the MLIR-AIE documentation.

module  {
   AIE.device(ipu){

   %tile00 = AIE.tile(0, 0)
   %tile02 = AIE.tile(0, 2)
   AIE.objectFifo @itbuffer_0___ITout___passthrough_0___img_in(%tile00, {%tile02}, 2 : i32) : !AIE.objectFifo<memref<1280xi32>>
   AIE.objectFifo @passthrough_0___img_out___itbuffer_1___ITin(%tile02, {%tile00}, 2 : i32) : !AIE.objectFifo<memref<1280xi32>>


   func.func private @passthrough(memref<1280xi32>, memref<1280xi32>) -> ()

   AIE.core(%tile02) {
      %c0 = arith.constant 0 : index
      %c1 = arith.constant 1 : index
      %intmax = arith.constant 0xFFFFFFFF : index
      scf.for %arg3 = %c0 to %intmax step %c1 {
         %subview1 = AIE.objectFifo.acquire @itbuffer_0___ITout___passthrough_0___img_in(Consume, 1) : !AIE.objectFifoSubview<memref<1280xi32>>
         %elem1 = AIE.objectFifo.subview.access %subview1[0] : !AIE.objectFifoSubview<memref<1280xi32>> -> memref<1280xi32>
         %subview2 = AIE.objectFifo.acquire @passthrough_0___img_out___itbuffer_1___ITin(Produce, 1) : !AIE.objectFifoSubview<memref<1280xi32>>
         %elem2 = AIE.objectFifo.subview.access %subview2[0] : !AIE.objectFifoSubview<memref<1280xi32>> -> memref<1280xi32>

         func.call @passthrough(%elem1, %elem2) : (memref<1280xi32>, memref<1280xi32>) -> ()

         AIE.objectFifo.release @itbuffer_0___ITout___passthrough_0___img_in(Consume, 1)
         AIE.objectFifo.release @passthrough_0___img_out___itbuffer_1___ITin(Produce, 1)
      }
      AIE.end
   } { link_with="passthrough.o" }

func.func @sequence(%itbuffer_0 : memref<720x1280xi32>,%itbuffer_1 : memref<720x1280xi32>) {
    %c0 = arith.constant 0 : i32
    %c1 = arith.constant 1 : i32
    %c720 = arith.constant 720 : i32
    %c1280 = arith.constant 1280 : i32


    AIEX.ipu.dma_memcpy_nd(%c0, %c0,%itbuffer_1[%c0, %c0, %c0, %c0][%c1, %c1, %c720, %c1280][%c0, %c0, %c0]){ metadata= @passthrough_0___img_out___itbuffer_1___ITin, id = 1 : i32 } :(i32, i32, memref<720x1280xi32>, [i32,i32,i32,i32], [i32,i32,i32,i32], [i32,i32,i32])

    AIEX.ipu.dma_memcpy_nd(%c0, %c0,%itbuffer_0[%c0, %c0, %c0, %c0][%c1, %c1, %c720, %c1280][%c0, %c0, %c0]){ metadata= @itbuffer_0___ITout___passthrough_0___img_in, id = 0 : i32 } :(i32, i32, memref<720x1280xi32>, [i32,i32,i32,i32], [i32,i32,i32,i32], [i32,i32,i32])

    AIEX.ipu.sync {column = 0 : i32, row = 0 : i32, direction = 0 : i32, channel = 0 : i32, column_num = 1 : i32, row_num = 1 : i32 }
    return
}
 }
}

Highlighting some aspects of the generated MLIR-AIE code

We will now highlight some aspects of the MLIR code that Riallto generated for the above example.

At the top of the generated MLIR-AIE code, we can see the following:

   %tile00 = AIE.tile(0, 0)
   %tile02 = AIE.tile(0, 2)

These are coordinates to tile locations within the NPU array. Our application uses the interface tile %tile00 at location (0,0) and a single compute tile %tile02 at location (0,2) for our passthrough kernel.

Communication between the tiles and kernels is handled via objectFifos (documentation), for example:

 AIE.objectFifo @itbuffer_0___ITout___passthrough_0___img_in(%tile00, {%tile02}, 2 : i32) : !AIE.objectFifo<memref<1280xi32>>

The above specifies that we want to connect the interface tile %tile00 to the tile where our passthrough kernel executes %tile02. The type and size of the transfers along this connection are specified with the memref.

Our passthrough kernel is not compiled with MLIR but with a vector VLIW compiler called Chess. However, we use MLIR-AIE to generate wrapper code around our kernel to control data moving into and out of the kernel. We call our passthrough kernel within the MLIR code and link it with the object file the Chess compiler generates. The function prototype for our passthrough kernel is specified as follows:

func.func private @passthrough(memref<1280xi32>, memref<1280xi32>) -> ()

The wrapper code and function call to our passthrough kernel executing on the compute tile is specified within a AIE.core block:

AIE.core(%tile02){
    //code executing on the compute tile (0,2) goes here
} { link_with="passthrough.o" }

In side our AIE.core block, we can see that we are calling our passthrough kernel in a loop with the following:

 func.call @passthrough(%elem1, %elem2) : (memref<1280xi32>, memref<1280xi32>) -> ()

Where each iteration of the loop, we access the following elements of the input and output objectFifos, %elem1 & %elem2. We can also notice how the objectFifos use the hardware locking architectural features to ensure they safely consume and produce data for the buffers.

Finally, the @sequence function at the end of the MLIR-AIE code is a special function that specifies how the interface tiles communicate with host memory.

func.func @sequence(%itbuffer_0 : memref<720x1280xi32>,%itbuffer_1 : memref<720x1280xi32>)

We can see that two buffer references, %itbuffer_0 and %itbuffer_1, in the sequence function arguments. These references are our input and output images on the host side.

Within the body of the sequence function, we can see two calls to an AIE.ipu.dma_memcpy_nd function; these are multi-dimensional memcpy commands that read and write to the host buffers and either push or pull data from the objectFifos connected to the compute or memory tiles. With this command, it is possible to iterate over the input and output buffers in up to 4 dimensions. Please refer to the MLIR-AIE documentation for more information on this operation here.

The final command in the sequence is a sync operation; this tells the system to wait until it has seen the last data transfer on a particular DMA channel and ensures that the system is synchronised.

Summary

In this blog, we have dived a bit deeper into the codegen side of Riallto, exploring how it can take a high-level description of the system, trace it, and use the tracking information to generate MLIR-AIE. We then explored some of the generated MLIR-AIE code to understand how this described the finally placed pipeline in the NPU array. For more information on MLIR-AIE, please refer to the MLIR-AIE documentation and MLIR-AIE tutorials.our simple pipeline we only have a single kernel, so we only have a single AIE.core block. We can also notice that at the end of the AIE.core block we have specified that we want to link the compiled code for this tile with the object file that Chess generated for our passthrough kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inspecting/Examining the MLIR-AIE codegen in Riallto #3

{{title}}

Replies: 0 comments

Select a reply

Inspecting/Examining the MLIR-AIE codegen in Riallto #3

STFleming Jan 10, 2024 Maintainer

Riallto MLIR-AIE codegen

Highlighting some aspects of the generated MLIR-AIE code

Summary

Replies: 0 comments

STFleming
Jan 10, 2024
Maintainer