-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proposal] Support of functional interfaces #516
Comments
Hi @andrii0lomakin , is this a proposal or an issue? You can build libraries on top of TornadoVM that solve specific functionality for domains of applications such as LLMs, Linear Algebra, Graphics, Physics, etc. |
This is a feature request. At least, I have chosen it like this :-) . At the moment, I need to repeat the same boilerplate code repeatedly, and it would benefit me to pass functional interfaces. |
Ok, I open this for discussion for all community members and TornadoVM maintainers.
What do you mean by passing functional interfaces? At which level? As I mentioned, one can build libraries on top. For example, LLM and transformer library that contain |
Simplest example. I have an RMS norm layer that, in a nutshell, consists of reduce kernels that are called in layers as described here: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf. Initially, I reduce the squired values and then just call the plain sum kernel in subsequent layers. The same is true for SoftMax, which uses a reduce function as its denominator. I suppose there are plenty of other such use cases. P.S. I am ready to implement this feature myself, but I need your feedback, of course. |
Reductions are supported in TornadoVM in two ways:
Note that the final reduction happens on the device. The TornadoVM JIT compiler generates two kernels from a single reduction kernel (one to run in parallel, and the second to perform the final reductions from the remaining work-groups).
|
@jjfumero seems like I was not clear enough, I know how to implement reduce and have knowledge of barriers and etc. I mean that if I had the ability to pass the function that I need to call when I perform reduce, I would not need to repeat the same boilerplate code again and again. For example in my code I have two kernels, one does:
and another one
in general I can pass just a function in both kernels, but instead I need to repeat the same code again and again. P.S. In general when I raise the issue I have impression that I always receive information how to implement basics of functionality instead of discussion of concrete issue in depth, probably I need change something in my conversational style to avoid this :-) |
As for TorandoVM annotations, I find them a good entry point for developers who want to learn about heterogenous computing, but because all those "complications" are done for nothing but performance, at least at this concrete stage they are not quite suitable for production usage, at least in commercial applications or in librariries that are created to support commercial tools. |
TornadoVM is an academic project fully developed and maintained by Master and PhD students, researchers and staff at The University of Manchester. It is not a product, at least yet. Feedback and contributions are very welcome.
More concrete questions with test cases will be useful. If I sketch what you want (pseudocode): public static void sampleReduction(KernelContext context,
FloatArray a,
FloatArray b,
FunctionalInterface f) {
int globalIdx = context.globalIdx;
int localIdx = context.localIdx;
int localGroupSize = context.localGroupSizeX;
int groupID = context.groupIdx; // Expose Group ID
float[] localA = context.allocateFloatLocalArray(256);
localA[localIdx] = a.get(globalIdx);
for (int stride = (localGroupSize / 2); stride > 0; stride /= 2) {
context.localBarrier();
if (localIdx < stride) {
localA[localIdx] = f.apply(localA[localIdx], localA[localIdx + stride]); // Use of a functional interface
}
}
if (localIdx == 0) {
b.set(groupID, localA[0]);
}
} The functional interface could potentially be used at any level and in different scenarios, not just for the compute. Is this a better approximation? |
First of all I truly believe that TornadoVM has great potential and is doing a lot of advertising for this project. Hopefully with some positive outcome. If all goes well I hope there will be a lot of contributions from my side. Do not understand me wrong, my last observation was intended only to improve the quality of observation and nothing more than that.
Absolutely, thank you for your summarization. |
As I have mentioned above I am ready to implement this feature myself, I am not sure that it fits project design. |
Comments and feedback are very valuable for us so we really appreciate your feedback. Hopefully with the help of community members like you, TornadoVM can improve in many aspects. This feature looks a great addition. If you want to implement these cases, feel free to open a PR. |
@jjfumero Cool, I am on it then. I will provide PR in a weeks. |
@jjfumero, here is a sketch of the steps that I will follow to implement the given issue. First of all, only The algorithm will be similar to already implemented in
As the result of the last step, the byte code of a new class will be generated, and the class will be defined using the P.S. With such an approach, I am thinking about loosening the requirements of the lambda code passed in the task. P.S.2 In later versions, I am going to validate the passed-in tasks using ASM to throw more meaningful exceptions than it is now when, in many cases, some kind of cryptic errors are thrown, leaving the users puzzled about what is really going on. |
IMHO, GrallVM JIT can be successfully replaced by the upcoming HAT project but it seems like it will take a while before the first version will be provided by the OpenJDK team. |
Follow up questions:
Which requirements are you referring to?
If we have the reference to the method already, why do we need to generate host code to be able to compile with TornadoVM? I did not get this part. |
HI @jjfumero
No. API stays the same.
I can not provide test-case now. But I will provide conceptual usage:
As I can see now, only
No. But I will probably need to add support for handling primitive wrappers, we will see. |
AFAIK most backends do not support polymorphic calls, so passing of lambda essentially means implicit generation of new kernel with passed function. Kotlin works exactly the same way by inlining passed in lambdas and generation of artificial functions to decrease object allocations. |
Do not think this requirement is actually needed. Will check during concrete implementaiton. |
Let us just skip it for a moment, from my experience the only robust way to pass kernels as of now is to pass static methods only. For example this code fails
Though the same code in static method works like a charm, but I need to perform deeper investigation about reasons. Seems like it all steams down to the handling of Unbox(ing) that I suppose I will need to deal anyway during implementation. In general I want to do the following with primitive wrappers: Allow only the following operations on wrappers in passed in code:
Those checks will be performed during resolving of method handles and will allow essentially to replace wrappers by primitives. But I will be ready to discuss it in more details when I will go further in implementation. |
@jjfumero Did I answer your questions? |
Yes. I think we have different views on implementation, which is normal. If you want to work on this, I suggest reviewing the proposal when you have a PoC, and we can iterate on this. I think what you will see is a new parameter in the Graal IR that corresponds to your new lambda function. Then, from my view, you can use Graal to get access to that lambda. To me, the bulk of the work is in the code gen. But again, it could be just another way to implement the same functionality. |
Sounds like a plan, thank you. |
Many tasks of heterogenous computing are variations of the same well-known patterns that differ only in functions called per element.
Like, reduce, for example, when SoftMax and RMS normalization, in a nutshell, differ only in functions processed.
It would be beneficial to support functional interfaces as parameters of kernels that unwind to real calls (with appropriate restrictions, of course).
That will minimize development time and increase the maintainability of kernels developed in TornadoVM.
The text was updated successfully, but these errors were encountered: