Load 70b model only once -- for embedding and for completion #592

ChristophJud · 2023-08-09T13:03:31Z

ChristophJud
Aug 9, 2023

Hi
I'm working on a retrieval augmented chatbot which has to perform completion and embedding. Actually, it has to switch back and forth. The problem is now, that the model has to be loaded either for embedding or for completion. As such, the model has to be held twice in the GPU memory. For my RTX A6000 this is ok for the 13b model. However, the 70b model fits only once into the memory.

Is there a reason or a fundamental principle why you cannot create embeddings if the model has been loaded without the embedding flag? It would be handy, if there would be a hybrid mode where you could load the entire model and then you can perform both operations.

I'm curious what you are thinking about this
Best
Christoph

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load 70b model only once -- for embedding and for completion #592

{{title}}

Replies: 0 comments

Select a reply

Load 70b model only once -- for embedding and for completion #592

ChristophJud Aug 9, 2023

Replies: 0 comments

ChristophJud
Aug 9, 2023