Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm_alloc returned 81: Out of memory #658

Open
kensmith opened this issue Jan 30, 2025 · 7 comments
Open

rm_alloc returned 81: Out of memory #658

kensmith opened this issue Jan 30, 2025 · 7 comments

Comments

@kensmith
Copy link

exo seems to be OOMing despite having lots of free RAM.

% cat /proc/meminfo|head
MemTotal:       16029428 kB
MemFree:         1478608 kB
MemAvailable:   10660452 kB
Buffers:          560476 kB
Cached:         10730028 kB
SwapCached:          820 kB
Active:          5688328 kB
Inactive:        7405268 kB
Active(anon):    2852260 kB
Inactive(anon):  1600664 kB

Image

% curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3.1-8b",
     "messages": [{"role": "user", "content": "What is the meaning of exo?"}],
     "temperature": 0.7
   }'
{"detail": "Error processing prompt (see logs with DEBUG>=2): rm_alloc returned 81: Out of memory"}%

If I read the README.md correctly, I should only need 16GB of RAM to run this model so having more than 100GB free seems like I should not be OOMing.

@AlexCheema
Copy link
Contributor

What device are you running on?

@kensmith
Copy link
Author

Arch Linux
AMD Ryzen 7 9700X
RTX3080TI

@andrenaP
Copy link

I think I have same issue.
I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory.
BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.

@ejrydhfs
Copy link

ejrydhfs commented Feb 8, 2025

sounds like EXO is trying to only use VRAM and do inference on the GPU instead of hybrid inference or resorting to CPU-only inference, i ran into the same issue today

@ejrydhfs
Copy link

ejrydhfs commented Feb 8, 2025

I think I have same issue. I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.

Mind saying how you were able to do it? Did you isolate the gpu from the rest of the system? Making the system seem like it had an integrated gpu only? I assume OOM may not be an issue in systems with integrated graphics/unified memory because both GPU and CPU share the same memory

@andrenaP
Copy link

andrenaP commented Feb 8, 2025

I think I have same issue. I am trying to run a Llama 3.2 1B on Arch Linux 16 GB of ram Intel CPU. I got same error Out of memory. BUT I was able to run it in cluster 7GB (other laptop)+4GB (GPU in docker) +15GB(main laptop RAM) with speed of 0.3 token/sec.

Mind saying how you were able to do it? Did you isolate the gpu from the rest of the system? Making the system seem like it had an integrated gpu only? I assume OOM may not be an issue in systems with integrated graphics/unified memory because both GPU and CPU share the same memory

This was pretty easy. I created a docker container with exo. Because I didn't pass GPU in container in it is used CPU. IMG_20250208_122918_994.jpg
Of course I have to download the same model on both container and main system.... In future I will use docker volumes to share model across docker and main system.

@ejrydhfs
Copy link

ejrydhfs commented Feb 10, 2025

I thought about implementing a docker workaround too it's reliable and effective but I believe it is not ideal since it implies that a network is also emulated in the system for communicating between the cpu and gpu and that imposes an overhead which can be significant at high speeds like 10gbps, an ideal solution I believe would be to support hybrid inference on individual nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants