-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multihead attention #199
base: main
Are you sure you want to change the base?
Multihead attention #199
Conversation
Hello Michael, I saw your pull requests and I think what you do is very interesting. Could you take a look at mine? milancurcic#2 What you have to look here is not the locally connected 1d layer but rather the reshape architecture I am trying to make. Your help would be very appreciated. Thanks |
48d93b2
to
86cd7c0
Compare
@OneAdder i see you are talking about a 2d handling. Would you like to make this together? I have to make this as well since I'm implementing a conv 1d layer |
Hi guys, thanks for pushing this forward. Today I'm finishing a lot of busywork with some proposals so next week I'm getting back more actively with neural-fortran work, and will be able to contribute to the reviews more actively |
@ricor07 Great idea! But we have a problem with generics again. The issues is that a |
Yes, I think we can make a generic predict. But I suggest you to create a new branch |
I think it's fine to make |
@milancurcic Done, here: #198 |
You can make it. I'll work on maxpool |
f9e7a7c
to
0900990
Compare
6a09663
to
992da67
Compare
Hi Michael, just in case you weren't aware, Ondrej has a procedural implementation of GPT-2, including a MHA subroutine here: https://github.com/certik/fastGPT/blob/main/gpt2.f90. You can use it as a reference Fortran implementation if you need. When you're ready for me to start playing with this PR, just mark it Ready for review. |
@milancurcic The attention backward and forward have both been done and tested like a week ago. I mostly used Attention Is All You Need paper as a reference as it is pretty straightforward and includes most of the formulae. And, besides, that example is not particularly useful here as it doesn't implement back prop. But we can reuse at a later point the piece that loads weights from HuggingFace. |
Except one thing: redundant arguments. I'll fix that |
So, my example works, the full check list:
@milancurcic I have this text classification example which can also be used as an example for MHA. But it's large and doesn't make much sense until I add Positional Encoding (WIP). Should I add it here anyway or should I add it a later point in a separate PR? I can come up with a small toy example here |
Excellent, thanks, I'll begin reviewing tomorrow morning. How large is it? In the past, I'd upload data tarballs as an attachment in a GitHub issue and use its permanent URL to download from programs. That way data storage doesn't go into git history. But there's some size limit. Our MNIST data used here is stored like that. |
For example of datasets stored here, see neural-fortran/src/nf/nf_datasets.f90 Lines 18 to 29 in c316ee1
|
…ue to IEEE_DENORMAL)
@milancurcic I think it's ready for review! I'll add the complicated example later. At this stage I added a simple example that converges nicely and doesn't require datasets and extra deps |
|
||
implicit none | ||
|
||
type, extends(multihead_attention_layer) :: cross_attention_layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is intentional that there is no plumbing for this one yet. I suggest that we add it at later stage when we have more components for seq2seq models. At this stage it can be added like this: without any public access
logical :: res | ||
|
||
res = all(abs(x - y) <= (1e-06 + 1e-05 * abs(y))) | ||
end function allclose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for future: create nf_utils.f90
(or similar) and put this procedure there
end do | ||
end subroutine create_attention_matrix | ||
|
||
pure module subroutine normalize_attention_matrix(self, attention_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attention_mask
is not accessible to the users by design at this point. It will be used by transformer decoder later and I'll add corresponding logic later
Hello, Milan! I hope, I'm not bothering you too much with my pull requests, but this is a good one. At this stage it is a draft of MultiHead Attention. It cannot be merged until work on
input2d_layer
andlinear2d_layer
is completed.Implementation of
dropout
would also help improve MHA, but it can be added later.MultiHead Attention
MultiHead Attention is the main component of Transformer architecture, which is the most advanced modern approach in the area of Natural Language Processing, as well as some other areas.
Here I propose an implementation based on the Transformer article. It works and its output conforms with SOTA implementation in PyTorch.
Problems
query
,key
andvalue
separately which will not work in the current paradigm. What can be done: implement it as solely Self Attention -- the input will be only one, then it will be copied thee times. However, this approach will not work for Cross Attention. For that, the layer has to be implemented as 3D Layer, while Self Attention will be 2D.main
again, so, it will not be an issue. At this stage, only those two files are to be reviewed:nf_multihead_attention.f90
andtest_multihead_attention_layer.f90
.Python Reference
Here is the snippet of code that uses PyTorch to calculate MultiHead Attention:
Output:
It is the same as in my tests.