Flash attention - head_dim 64 #1047

peregilk · 2024-11-18T07:39:13Z

I have tried using MaxText to train Llama 3.2 3B. This seems to work fine with just minor modifications to the configs.

However, I am unable to train the Llama 1B. The reason is that Flash/Splash attention seem to require that the head_dim is divisible by 128. The head_dim of the 1B model is only 64. I get a "not implemented" error. Using dot_product attention for long context lengths is really challenging.

Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention - head_dim 64 #1047

Flash attention - head_dim 64 #1047

peregilk commented Nov 18, 2024

Flash attention - head_dim 64 #1047

Flash attention - head_dim 64 #1047

Comments

peregilk commented Nov 18, 2024