Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The model is loaded repeatedly #692

Open
belog2867 opened this issue Feb 9, 2025 · 5 comments
Open

The model is loaded repeatedly #692

belog2867 opened this issue Feb 9, 2025 · 5 comments

Comments

@belog2867
Copy link

The model was loaded twice, and the 1B llama model took up more than 10g of ram, is this normal

新建 文本文档.txt

@belog2867
Copy link
Author

linux tinygrad

@yetisno
Copy link

yetisno commented Feb 13, 2025

same problem!

llama-3-8b, llama-3.2-1b will load model into GPU memory twice!

In AWS g6, g4dn generation instance with Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20250202 AMI

@AlexCheema
Copy link
Contributor

These looks like two different models.
First one should be unloaded from memory when second one is loaded in.

@yetisno
Copy link

yetisno commented Feb 14, 2025

In my experiment, it's happened on every model I tried on linux instance.

cluster with only one node

it will load same model twice when send a request at first time.

cluster with multi nodes

The node which receive API request will load same model twice with the partition it should processed at first time.

the other nodes will load only one time, not twice.

load model --> load model into to GPU memory.

load same model twice -->  load same model into to GPU memory twice, occupied 2x memory.

you can see the log when scroll up the TUI.

just like the attachment @belog2867 posted, it's only happened at the node which receive API request.

@MostHated
Copy link

MostHated commented Feb 16, 2025

I just started seeing this today as well (on Jetson Orin Nano), no idea why. Trying to load llama 3.2 1b model took up all (8gb, but about 7.4 available) memory and locked up the device each time I attempted. After the first locked up, I watched the loading output and kept seeing it load all the way, then just start doing it again.

I noticed the model it was trying to use started with unsloth/restOfModelName. I can't say I remember seeing it being an unsloth model before, but that doesn't mean it wasn't. Just don't remember it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants