Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mid-level intermediate representation for tensor computations #50

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

uphoffc
Copy link
Contributor

@uphoffc uphoffc commented Jun 13, 2023

I think we can all agree that the codegen part of YATeTo is in a pretty bad shape. My list of open problems is the following:

  • The design is heavily CPU-oriented. The heavy lifting is done by have loop over GEMM calls. As calls to a single small GEMM would be very slow on the GPU, implicit batching was added to codegen. However, that is still insufficient as one eventually wants fused kernels where multiple small GEMMs are fused in a single GPU kernel. The fused-gemm approach goes in the right direction but currently works only for chainforge and does only support matrix multiplication.
  • There is no specification what the individual components like gemm, indexsum, log, etc... have to do. Everything has to be inferred from the code.
  • A JIT-mode is currently not included. The only possibility is to include the kernel as a static variable in a kernel function. (As for the LIBXSSM-JIT generator.)
  • There is currently too much stuff packed in codegen. We have the kernel code, the tensor descriptor code, the initialization code, the unit test code, ..., it's hard to understand and to maintain.

To address some of these problems, I propose to create a mid-level tensor IR for codegen that is inspired by the LLVM IR. The idea is that YATeTo maps the output from controflow to the tensor IR. (Or controlflow already operates on the tensor IR - would be even better.) The tensor IR can then be picked up by a back-end specific tool that generates the low-level code for the target architecture. Here, the CPU-back-end could be similar to the current design, where generic C-code is mixed with calls to BLAS libraries, whereas the GPU back-end(s) likely need to generate both the glue code and the BLAS code.

The current draft added in this PR is by no means final, I just extracted a language definition of my current state, where many things are still missing.

Some general points I'd like to put for discussion:

  • Do you think that a tensor IR is a generally good solution to tackle some of the open problems?
  • Who would be interested in adopting the tensor IR for the back-end implementation? Maybe @ravil-mobile for chainforge?
  • Should we still include the CSC memory layout for sparse matrix multiplication? There are currently many undocumented restrictions for the CSC format because a true support for sparse is difficult. Is sparse actually used for new developments such as ARM @krenzland ?

uphoffc added 2 commits June 13, 2023 12:13
Signed-off-by: Carsten Uphoff <[email protected]>
Signed-off-by: Carsten Uphoff <[email protected]>
@krenzland
Copy link
Contributor

Do you think that a tensor IR is a generally good solution to tackle some of the open problems?

Yes, of course!

Should we still include the CSC memory layout for sparse matrix multiplication? There are currently many undocumented restrictions for the CSC format because a true support for sparse is difficult. Is sparse actually used for new developments such as ARM @krenzland ?

Yes, we spent a considerable amount of work getting Pspamm to work on A64FX. It's probably smart to include it, as it makes sense for some kernels. It would also make yateto a bit more general.

Copy link
Contributor

@davschneller davschneller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, great idea! So the IR itself is supposed to work mostly with matrices? Or also higher-dimensional tensors (cf. the comment in memref)?

Some remarks which came to my mind when using YATeTo, or rather, feature ideas (if we are done with implementing the existing spec):

  • numpy-like features (most of these ideas mostly apply to higher-level tensors, but all of them could be done for matrices only as well):
    • matrix broadcasting, i.e. introduce a new dimension or stack copies of a matrix (if that's quicker than multiplying with a vector of just ones)
    • trace computation, i.e. something like A['ii']? Maybe also partial traces, i.e. in tensor notation something like A['iij'].
    • matrix reshaping (in the end, that should amount to a zero-cost instruction anyways)
    • transposition (or rather, dimension swapping ... Maybe sometimes more useful than looping over a bad GEMM config?)
  • point-wise nonlinear function evaluation (with that, we could re-formulate most likely a lot of dynamic rupture and/or plasticity using YATeTo)

.. code:: abnf

submatrix-instruction = "submatrix" local-identifier "[" slice "," slice "]"
slice = integer-constant ":" integer-constant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a stride here? (or is that something for a higher-level IR instead?)
Probably, that would be something if memref will only support 2D tensors (i.e. matrices). Otherwise one could just reshape the tensor instead and take what's useful.

@uphoffc
Copy link
Contributor Author

uphoffc commented Jun 16, 2023

Hi, great idea! So the IR itself is supposed to work mostly with matrices? Or also higher-dimensional tensors (cf. the comment in memref)?

Yes, higher-dimensional tensors should be supported as well. The spec also needs to be extended to include loop-over-GEMM, indexsum, and product nodes. The question is here which level abstraction one should choose. E.g. for loop-over-GEMM one could either introduce a dedicated instruction or one could introduce some kind of looping construct over matmul.

  • numpy-like features (most of these ideas mostly apply to higher-level tensors, but all of them could be done for matrices only as well):

Which numpy features?

  • matrix broadcasting, i.e. introduce a new dimension or stack copies of a matrix (if that's quicker than multiplying with a vector of just ones)

One could add a generic copy instruction that could also serve as a broadcast. In the high-level language of YATeTo I'd rather not introduce broadcasting (see paper) but one could pattern match a multiplication with a vector of ones to a copy.

  • trace computation, i.e. something like A['ii']? Maybe also partial traces, i.e. in tensor notation something like A['iij'].

Uhuh, that's going to be ugly. So A[iji], for instance, could mean two things. Either it means B_j = sum_i A[iji], i.e. we have to introduce an IndexSum node, or it means B_ij = A[iji] should be viewed as matrix. In the latter case, the entry ij is found at A + i + j * ld0 + i * ld0 * ld1 = A + i * (1 + ld0 * ld1) + j * ld0, that is a matrix with non-unit stride.
So it might be possible, but I think rather at the level of the high-level language and not on the level of the IR.

  • matrix reshaping (in the end, that should amount to a zero-cost instruction anyways)

Yes, here one would just need an instruction that maps one type to another type.

  • transposition (or rather, dimension swapping ... Maybe sometimes more useful than looping over a bad GEMM config?)

Yes, transposition would be good to have.

  • point-wise nonlinear function evaluation (with that, we could re-formulate most likely a lot of dynamic rupture and/or plasticity using YATeTo)

The question is how would describe the non-linear function. One option is via a language extension, but that is going to limited. The other option would be to allow some function call to a user-provided function. But the latter is tricky as we want to support different run-times (i.e. CPU vs GPU).

@ravil-mobile
Copy link
Contributor

Hi all,

@uphoffc, maybe we shouldn't get restricted by only one IR. I think that several IRs would be beneficial for different compilation stages.

One more thing that came to my mind is the use of attributes. For example, information like shape, stride indices, etc. can be a part of the tensor attribute.

For-loops, which are used in LoG, can be a model somehow similar to the affine dialect in MLIR.

@ravil-mobile
Copy link
Contributor

ravil-mobile commented Jul 14, 2023

@uphoffc , I open a discussion with xDSL developers: xdslproject/xdsl#1277. I think it may be interesting for you to have a look at.

In short, xDSL does not need MLIR by default.

@davschneller
Copy link
Contributor

Hi, great idea! So the IR itself is supposed to work mostly with matrices? Or also higher-dimensional tensors (cf. the comment in memref)?

Yes, higher-dimensional tensors should be supported as well. The spec also needs to be extended to include loop-over-GEMM, indexsum, and product nodes. The question is here which level abstraction one should choose. E.g. for loop-over-GEMM one could either introduce a dedicated instruction or one could introduce some kind of looping construct over matmul.

In hindsight, I may have mis-understood the purpose of this IR—sorry about that—and therefore, I'll have to ask again from the beginning:

For what do we intend to use this IR?

I.e. on what level? Do we want the IR to be 1:1 translatable to a code file, do we rather want it to be architecture-independent, or maybe just to apply high-level tensor operations? I'd agree with Ravil that we may need several IRs in the end.
On that question depends whether what I wrote below made/makes sense or not.

  • numpy-like features (most of these ideas mostly apply to higher-level tensors, but all of them could be done for matrices only as well):

Which numpy features?

The tensor-manipulation features I meant below; but again, they could be a bit too high-level.

[...]

  • point-wise nonlinear function evaluation (with that, we could re-formulate most likely a lot of dynamic rupture and/or plasticity using YATeTo)

The question is how would describe the non-linear function. One option is via a language extension, but that is going to limited. The other option would be to allow some function call to a user-provided function. But the latter is tricky as we want to support different run-times (i.e. CPU vs GPU).

I meant mostly functions that are available in e.g. the cmath header (plus division), i.e. available in SSE/AVX{-,2,10,512} or via such extensions as SVML. And most likely also readily existent on a GPU. E.g. sin, cos, sqrt, max, min, division, and so on and so forth. I guess these kind of operations would need to be available on all kinds of IR, if we want to support them.

One further point would also be maybe float-to-double conversion (and vice-versa)?

@uphoffc
Copy link
Contributor Author

uphoffc commented Aug 29, 2023

For what do we intend to use this IR?

I.e. on what level? Do we want the IR to be 1:1 translatable to a code file, do we rather want it to be architecture-independent, or maybe just to apply high-level tensor operations? I'd agree with Ravil that we may need several IRs in the end. On that question depends whether what I wrote below made/makes sense or not.

The IR is intended to be mid-level. In the IR, there are only binary operations (like gemm, axpy, ...) and one needs to manage the temporary storage. We need to introduce the IR because we cannot easily translate C-code-that-calls-GEMM on GPUs.
The IR is architecture-independent and it is also serial. It is up to the compiler to implement vectorization and parallelization.

I meant mostly functions that are available in e.g. the cmath header (plus division), i.e. available in SSE/AVX{-,2,10,512} or via such extensions as SVML. And most likely also readily existent on a GPU. E.g. sin, cos, sqrt, max, min, division, and so on and so forth. I guess these kind of operations would need to be available on all kinds of IR, if we want to support them.

I would not support those as long as no one needs them. If there's an application in SeisSol / tandem or someone requests such features then we can add them but for now we should keep it simple and stupid.

One further point would also be maybe float-to-double conversion (and vice-versa)?

Same as above, if anyone wants to work on mixed-precision in SeisSol then we can think about it, otherwise no.

@davschneller
Copy link
Contributor

For what do we intend to use this IR?
I.e. on what level? Do we want the IR to be 1:1 translatable to a code file, do we rather want it to be architecture-independent, or maybe just to apply high-level tensor operations? I'd agree with Ravil that we may need several IRs in the end. On that question depends whether what I wrote below made/makes sense or not.

The IR is intended to be mid-level. In the IR, there are only binary operations (like gemm, axpy, ...) and one needs to manage the temporary storage. We need to introduce the IR because we cannot easily translate C-code-that-calls-GEMM on GPUs. The IR is architecture-independent and it is also serial. It is up to the compiler to implement vectorization and parallelization.

I'm sorry, maybe I'm too new to everything... But there are still more questions: how are the operations meant to be executed? I.e. strictly one after another, or may there also be some sort of tiling, or keeping data in registers (e.g. we want to compute a taylor series as done in Seissol, i.e. a series of axpys onto the same target—then the target data could be kept in registers maybe...? )? How would we handle the temporary memory there?

In particular (just to see if I got it now finally), for tiling, if ever necessary: would we just define sub-regions for tensors and go over them (resulting in less temporary memory than running over a whole matrix at once)?

I meant mostly functions that are available in e.g. the cmath header (plus division), i.e. available in SSE/AVX{-,2,10,512} or via such extensions as SVML. And most likely also readily existent on a GPU. E.g. sin, cos, sqrt, max, min, division, and so on and so forth. I guess these kind of operations would need to be available on all kinds of IR, if we want to support them.

I would not support those as long as no one needs them. If there's an application in SeisSol / tandem or someone requests such features then we can add them but for now we should keep it simple and stupid.

They would be useful to fuse e.g. plasticity, dynamic rupture or boundary computations, maybe, at least that was what I was thinking here. I.e. merge the matrix multiplication (modal->nodal) with the scalar operations happening afterwards (operations per node), then transform back (nodal->modal). Then we could also avoid having explicit SYCL kernels in the code, mostly.

One further point would also be maybe float-to-double conversion (and vice-versa)?

Same as above, if anyone wants to work on mixed-precision in SeisSol then we can think about it, otherwise no.

That's why I was asking: SeisSol/SeisSol#932 . It's not an immediate requirement, but just to maybe keep in mind longer-term.

@uphoffc
Copy link
Contributor Author

uphoffc commented Aug 29, 2023

I'm sorry, maybe I'm too new to everything... But there are still more questions: how are the operations meant to be executed? I.e. strictly one after another, or may there also be some sort of tiling, or keeping data in registers (e.g. we want to compute a taylor series as done in Seissol, i.e. a series of axpys onto the same target—then the target data could be kept in registers maybe...? )? How would we handle the temporary memory there?

The operations are meant to be executed in the order that they are written. There is nothing special here. For temporary memory there is the alloca instruction, that needs to be generated by yateto. Tiling and vectorization choices are left to the IR compiler.

In particular (just to see if I got it now finally), for tiling, if ever necessary: would we just define sub-regions for tensors and go over them (resulting in less temporary memory than running over a whole matrix at once)?

I guess you're thinking more about large tensor contractions here. The tensors we have in SeisSol should be small enough to fit in shared local memory / L2 cache.
One could also consider to implement tiling algorithms for large tensor contractions in the IR, but that would likely require that tensors have dynamic mode sizes. That is something I want to avoid for now as it makes writing the compiler more difficult but adds little use to SeisSol.

They would be useful to fuse e.g. plasticity, dynamic rupture or boundary computations, maybe, at least that was what I was thinking here. I.e. merge the matrix multiplication (modal->nodal) with the scalar operations happening afterwards (operations per node), then transform back (nodal->modal). Then we could also avoid having explicit SYCL kernels in the code, mostly.

Yeah that's possible but one should keep in mind whether it is more work to have everything in yateto or to just write a specialized CPU / SYCL kernel for Plasticity.

That's why I was asking: SeisSol/SeisSol#932 . It's not an immediate requirement, but just to maybe keep in mind longer-term.

I think the largest difficulty is whether you have an explicit copy for floating point type conversion or whether you can fuse that with GEMM / axpy / etc. Currently, I don't think that any of the GEMM back-ends supports mixed precision.

@krenzland
Copy link
Contributor

The operations are meant to be executed in the order that they are written. There is nothing special here. For temporary memory there is the alloca instruction, that needs to be generated by yateto. Tiling and vectorization choices are left to the IR compiler.

But the compiler is surely allowed to reorder the operations at will?

Yeah that's possible but one should keep in mind whether it is more work to have everything in yateto or to just write a specialized CPU / SYCL kernel for Plasticity.

I would also vote for for loops with constant iteration count and simple non-linear functions. This would really simplify many kernels and would actually be quite useful.

@uphoffc
Copy link
Contributor Author

uphoffc commented Aug 29, 2023

But the compiler is surely allowed to reorder the operations at will?

Yes, of course, as long as it is legal to do so. The idea is that if, e.g., a "matmul" instruction is encountered then the compiler is going to parallelize that however it wants to, e.g. using a combination of vector instructions and possibly threads.

I would also vote for for loops with constant iteration count and simple non-linear functions. This would really simplify many kernels and would actually be quite useful.

For non-linear functions I'd use an "apply" or an "apply_along_axis" instruction. That's easy to write and easy to vectorize. Loops might be allowed as well to implement loop-over-GEMM (instead of a LoG instruction), but I am not completely sure about that point yet.

Signed-off-by: Carsten Uphoff <[email protected]>
@uphoffc
Copy link
Contributor Author

uphoffc commented Sep 15, 2023

I pushed a larger update to the spec.

The largest open question to me addressed in this update is whether one should have a dedicated instruction for loop-over-GEMM, index sum, and product or have loops over BLAS instructions. I went for dedicated instructions for the following reasons:

  • The mapping to standard BLAS is clear for loop-over-GEMM but not so clear for the other operations. I thought about things like loop-over-GER but quickly realized that the "product" operation is more complex and versatile than a loop-over-GER. Essentially, for product, sum, copyscaleadd I am not aware what the "gold-standard-solution" is in terms of mapping them to BLAS or other primitives.
  • Dedicated instructions can easily be lowered to more primitive instructions but the other way is more difficult. That is, it is easier to identify a special loop-over-GEMM instruction from a high-level description than having to reconstruct the loop-over-GEMM from a loop + GEMM description.

Therefore, I have introduced the log (loop-over-GEMM), product, sum, and axpby instruction. They all share the common feature that an index map has to provided as attribute to the instruction. E.g. the product C[ijk] <= A[ij] B[k] would be annotated with

{ij, k to ijk}

in the tensor IR. For loop-over-GEMM, the index map must be annotated to indicate looped and fused modes, e.g.

{i[k]m, m(jn) to i(jn)[k]}

That way, we can select loop-over-GEMM in yateto. One question here is whether such annotations are mandatory to the compiler or whether they are only hints to the compiler? In any way, I think they must be always present such that the compiler does not need to implement the contraction to loop over GEMM mapping logic.

Further large changes are the introduction of arbitrary high order tensors, the subview instruction (replacing the submatrix instruction), and the introduction of hexadecimal floating point constants (as in C). Moreover, types need to be added in every instruction such that the compiler does not need any type inference logic.

Short example:

define @fused_kernel(%A: group<memref<f32x16x8>,pointers>,
                     %B: memref<f32x8x8>,
                     %C: memref<f32x8x16>,
                     %D: group<memref<f32x16x16>,distance<256>>) {
  %0 = get_work_item %A : group<memref<f32x16x8>,pointers>
  %1 = get_work_item %D : group<memref<f32x16x16>,distance<256>>
  %tmp0 = alloca : memref<f32x16x8>
  log {ik, kj to ij} 1.0, %0, %B, 0.0, %tmp0 : memref<f32x16x8>, memref<f32x8x8> to memref<f32x8x16>
  log {ik, kj to ij} 5.0, %tmp0, %C, 1.0, %1 : memref<f32x16x8>, memref<f32x8x16> to memref<f32x16x16>
}

Looking forward to your input / comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants