You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tried using MaxText to train Llama 3.2 3B. This seems to work fine with just minor modifications to the configs.
However, I am unable to train the Llama 1B. The reason is that Flash/Splash attention seem to require that the head_dim is divisible by 128. The head_dim of the 1B model is only 64. I get a "not implemented" error. Using dot_product attention for long context lengths is really challenging.
Any ideas?
The text was updated successfully, but these errors were encountered:
I have tried using MaxText to train Llama 3.2 3B. This seems to work fine with just minor modifications to the configs.
However, I am unable to train the Llama 1B. The reason is that Flash/Splash attention seem to require that the head_dim is divisible by 128. The head_dim of the 1B model is only 64. I get a "not implemented" error. Using dot_product attention for long context lengths is really challenging.
Any ideas?
The text was updated successfully, but these errors were encountered: