Load 70b model only once -- for embedding and for completion #592
Unanswered
ChristophJud
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi
I'm working on a retrieval augmented chatbot which has to perform completion and embedding. Actually, it has to switch back and forth. The problem is now, that the model has to be loaded either for embedding or for completion. As such, the model has to be held twice in the GPU memory. For my RTX A6000 this is ok for the 13b model. However, the 70b model fits only once into the memory.
Is there a reason or a fundamental principle why you cannot create embeddings if the model has been loaded without the embedding flag? It would be handy, if there would be a hybrid mode where you could load the entire model and then you can perform both operations.
I'm curious what you are thinking about this
Best
Christoph
Beta Was this translation helpful? Give feedback.
All reactions