Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Support: Originally limited to single-GPU setups, this code… #48

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

StevenChen16
Copy link

  1. Multi-GPU Support: Originally limited to single-GPU setups, this code now leverages the Accelerate library with init_empty_weights and infer_auto_device_map for multi-GPU deployment, maximizing memory utilization across available GPUs.

  2. Efficient Weight Management: By using load_checkpoint_and_dispatch, model weights are dynamically allocated across GPUs and offloaded to disk as needed, enhancing memory efficiency for larger models.

  3. Nested Event Loop Support: The addition of nest_asyncio enables nested event loops, improving compatibility when running FastAPI within Jupyter or similar environments.

  4. Code Simplification: Streamlined model and tokenizer loading eliminates manual device allocation, making the code more readable and efficient.

… now leverages the Accelerate library with init_empty_weights and infer_auto_device_map for multi-GPU deployment, maximizing memory utilization across available GPUs.
@zRzRzRzRzRzRzR
Copy link
Member

Why not directly use the auto solution provided by transformers? This solution can automatically allocate models to different GPUs (GPUs with insufficient memory)

@StevenChen16
Copy link
Author

Why not directly use the auto solution provided by transformers? This solution can automatically allocate models to different GPUs (GPUs with insufficient memory)

I chose a custom approach over the transformers auto allocation because it offers finer control over GPU memory management. Specifically, by using Accelerate with init_empty_weights and infer_auto_device_map, I can define exact memory constraints per GPU, ensuring stable distribution even when memory is limited or varies across devices. This method also leverages offloading to disk for parts of the model that exceed GPU capacity, reducing the risk of memory issues during runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants