You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This feature would allow developers to fine tune on smaller GPUs and / or larger batch sizes, likely leading to higher MFU.
Currently, activation checkpointing only works with FSDP, not single GPU.
Perhaps this is not a worthwhile feature because no one is going to realistically fine tune on a single GPU anyways, and as a workaround, you could just turn on FSDP on a single GPU to enable activation checkpointing. I verified that this workaround works. nvidia-smi now says python is using about 45GB of ram instead of 77GB without FSDP, and my time per batch increased from 1.09 to 1.65.
After bumping the batch size from 11 to 19 to take advantage of the new memory, my TPS bumped from 20_859 to 23_583 and MFU from 50% to 56%.
"""This is a minimal wrapper so that torchrun has a physical py file to access."""importfireimportllama_recipes.finetuningif__name__=="__main__":
fire.Fire(llama_recipes.finetuning.main)
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
This feature would allow developers to fine tune on smaller GPUs and / or larger batch sizes, likely leading to higher MFU.
Currently, activation checkpointing only works with FSDP, not single GPU.
Perhaps this is not a worthwhile feature because no one is going to realistically fine tune on a single GPU anyways, and as a workaround, you could just turn on FSDP on a single GPU to enable activation checkpointing. I verified that this workaround works.
nvidia-smi
now says python is using about 45GB of ram instead of 77GB without FSDP, and my time per batch increased from 1.09 to 1.65.After bumping the batch size from 11 to 19 to take advantage of the new memory, my TPS bumped from 20_859 to 23_583 and MFU from 50% to 56%.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: