-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mid-level intermediate representation for tensor computations #50
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Carsten Uphoff <[email protected]>
Signed-off-by: Carsten Uphoff <[email protected]>
Yes, of course!
Yes, we spent a considerable amount of work getting Pspamm to work on A64FX. It's probably smart to include it, as it makes sense for some kernels. It would also make yateto a bit more general. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, great idea! So the IR itself is supposed to work mostly with matrices? Or also higher-dimensional tensors (cf. the comment in memref
)?
Some remarks which came to my mind when using YATeTo, or rather, feature ideas (if we are done with implementing the existing spec):
- numpy-like features (most of these ideas mostly apply to higher-level tensors, but all of them could be done for matrices only as well):
- matrix broadcasting, i.e. introduce a new dimension or stack copies of a matrix (if that's quicker than multiplying with a vector of just ones)
- trace computation, i.e. something like
A['ii']
? Maybe also partial traces, i.e. in tensor notation something likeA['iij']
. - matrix reshaping (in the end, that should amount to a zero-cost instruction anyways)
- transposition (or rather, dimension swapping ... Maybe sometimes more useful than looping over a bad GEMM config?)
- point-wise nonlinear function evaluation (with that, we could re-formulate most likely a lot of dynamic rupture and/or plasticity using YATeTo)
docs/tensor-ir.rst
Outdated
.. code:: abnf | ||
|
||
submatrix-instruction = "submatrix" local-identifier "[" slice "," slice "]" | ||
slice = integer-constant ":" integer-constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a stride here? (or is that something for a higher-level IR instead?)
Probably, that would be something if memref
will only support 2D tensors (i.e. matrices). Otherwise one could just reshape the tensor instead and take what's useful.
Yes, higher-dimensional tensors should be supported as well. The spec also needs to be extended to include loop-over-GEMM, indexsum, and product nodes. The question is here which level abstraction one should choose. E.g. for loop-over-GEMM one could either introduce a dedicated instruction or one could introduce some kind of looping construct over matmul.
Which numpy features?
One could add a generic copy instruction that could also serve as a broadcast. In the high-level language of YATeTo I'd rather not introduce broadcasting (see paper) but one could pattern match a multiplication with a vector of ones to a copy.
Uhuh, that's going to be ugly. So A[iji], for instance, could mean two things. Either it means B_j = sum_i A[iji], i.e. we have to introduce an IndexSum node, or it means B_ij = A[iji] should be viewed as matrix. In the latter case, the entry ij is found at A + i + j * ld0 + i * ld0 * ld1 = A + i * (1 + ld0 * ld1) + j * ld0, that is a matrix with non-unit stride.
Yes, here one would just need an instruction that maps one type to another type.
Yes, transposition would be good to have.
The question is how would describe the non-linear function. One option is via a language extension, but that is going to limited. The other option would be to allow some function call to a user-provided function. But the latter is tricky as we want to support different run-times (i.e. CPU vs GPU). |
Hi all, @uphoffc, maybe we shouldn't get restricted by only one IR. I think that several IRs would be beneficial for different compilation stages. One more thing that came to my mind is the use of attributes. For example, information like shape, stride indices, etc. can be a part of the tensor attribute. For-loops, which are used in LoG, can be a model somehow similar to the affine dialect in MLIR. |
@uphoffc , I open a discussion with xDSL developers: xdslproject/xdsl#1277. I think it may be interesting for you to have a look at. In short, xDSL does not need MLIR by default. |
In hindsight, I may have mis-understood the purpose of this IR—sorry about that—and therefore, I'll have to ask again from the beginning: For what do we intend to use this IR? I.e. on what level? Do we want the IR to be 1:1 translatable to a code file, do we rather want it to be architecture-independent, or maybe just to apply high-level tensor operations? I'd agree with Ravil that we may need several IRs in the end.
The tensor-manipulation features I meant below; but again, they could be a bit too high-level.
I meant mostly functions that are available in e.g. the One further point would also be maybe float-to-double conversion (and vice-versa)? |
The IR is intended to be mid-level. In the IR, there are only binary operations (like gemm, axpy, ...) and one needs to manage the temporary storage. We need to introduce the IR because we cannot easily translate C-code-that-calls-GEMM on GPUs.
I would not support those as long as no one needs them. If there's an application in SeisSol / tandem or someone requests such features then we can add them but for now we should keep it simple and stupid.
Same as above, if anyone wants to work on mixed-precision in SeisSol then we can think about it, otherwise no. |
I'm sorry, maybe I'm too new to everything... But there are still more questions: how are the operations meant to be executed? I.e. strictly one after another, or may there also be some sort of tiling, or keeping data in registers (e.g. we want to compute a taylor series as done in Seissol, i.e. a series of axpys onto the same target—then the target data could be kept in registers maybe...? )? How would we handle the temporary memory there? In particular (just to see if I got it now finally), for tiling, if ever necessary: would we just define sub-regions for tensors and go over them (resulting in less temporary memory than running over a whole matrix at once)?
They would be useful to fuse e.g. plasticity, dynamic rupture or boundary computations, maybe, at least that was what I was thinking here. I.e. merge the matrix multiplication (modal->nodal) with the scalar operations happening afterwards (operations per node), then transform back (nodal->modal). Then we could also avoid having explicit SYCL kernels in the code, mostly.
That's why I was asking: SeisSol/SeisSol#932 . It's not an immediate requirement, but just to maybe keep in mind longer-term. |
The operations are meant to be executed in the order that they are written. There is nothing special here. For temporary memory there is the alloca instruction, that needs to be generated by yateto. Tiling and vectorization choices are left to the IR compiler.
I guess you're thinking more about large tensor contractions here. The tensors we have in SeisSol should be small enough to fit in shared local memory / L2 cache.
Yeah that's possible but one should keep in mind whether it is more work to have everything in yateto or to just write a specialized CPU / SYCL kernel for Plasticity.
I think the largest difficulty is whether you have an explicit copy for floating point type conversion or whether you can fuse that with GEMM / axpy / etc. Currently, I don't think that any of the GEMM back-ends supports mixed precision. |
But the compiler is surely allowed to reorder the operations at will?
I would also vote for for loops with constant iteration count and simple non-linear functions. This would really simplify many kernels and would actually be quite useful. |
Yes, of course, as long as it is legal to do so. The idea is that if, e.g., a "matmul" instruction is encountered then the compiler is going to parallelize that however it wants to, e.g. using a combination of vector instructions and possibly threads.
For non-linear functions I'd use an "apply" or an "apply_along_axis" instruction. That's easy to write and easy to vectorize. Loops might be allowed as well to implement loop-over-GEMM (instead of a LoG instruction), but I am not completely sure about that point yet. |
Signed-off-by: Carsten Uphoff <[email protected]>
I pushed a larger update to the spec. The largest open question to me addressed in this update is whether one should have a dedicated instruction for loop-over-GEMM, index sum, and product or have loops over BLAS instructions. I went for dedicated instructions for the following reasons:
Therefore, I have introduced the log (loop-over-GEMM), product, sum, and axpby instruction. They all share the common feature that an index map has to provided as attribute to the instruction. E.g. the product C[ijk] <= A[ij] B[k] would be annotated with {ij, k to ijk} in the tensor IR. For loop-over-GEMM, the index map must be annotated to indicate looped and fused modes, e.g. {i[k]m, m(jn) to i(jn)[k]} That way, we can select loop-over-GEMM in yateto. One question here is whether such annotations are mandatory to the compiler or whether they are only hints to the compiler? In any way, I think they must be always present such that the compiler does not need to implement the contraction to loop over GEMM mapping logic. Further large changes are the introduction of arbitrary high order tensors, the subview instruction (replacing the submatrix instruction), and the introduction of hexadecimal floating point constants (as in C). Moreover, types need to be added in every instruction such that the compiler does not need any type inference logic. Short example:
Looking forward to your input / comments! |
I think we can all agree that the codegen part of YATeTo is in a pretty bad shape. My list of open problems is the following:
To address some of these problems, I propose to create a mid-level tensor IR for codegen that is inspired by the LLVM IR. The idea is that YATeTo maps the output from controflow to the tensor IR. (Or controlflow already operates on the tensor IR - would be even better.) The tensor IR can then be picked up by a back-end specific tool that generates the low-level code for the target architecture. Here, the CPU-back-end could be similar to the current design, where generic C-code is mixed with calls to BLAS libraries, whereas the GPU back-end(s) likely need to generate both the glue code and the BLAS code.
The current draft added in this PR is by no means final, I just extracted a language definition of my current state, where many things are still missing.
Some general points I'd like to put for discussion: