-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single GPU vs multiple GPU (tensor parallel) suggestion for API Server #208
Comments
If you use 4x4060ti the speed of inference might be slow given the communication overhead between the GPU's (not tried a benchmark but just based on theoretical knowledge). If you have 24GB GPU better use that one. In the backend if you look at the packages openchat uses pytorch, vLLM and Ray, so if you can configure the underlying libraries for AMD GPU (RCOM) then you should be able to use openchat model with AMD GPU, theoretically like the way they are now supported by ollama (because of all the hardwork done by llama.cpp to support AMD GPU's, given ollama is a wrapper around that library). You can still run openchat models using llama.cpp with AMD GPU using following guide:
Output of the above issystem_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | Large Language ModelsLarge language models (LLMs) are artificial intelligence models that have been trained to understand and generate human-like text. They are called "large" because they typically consist of millions or even billions of parameters, which enable them to learn complex patterns and generate more accurate and coherent responses. Examples of popular LLMs include OpenAI's GPT-3, GPT-4, and Google's BERT. Components of an LLM
Sample Script using
|
wow thanks for your detailed reply! Appreciated! |
@vikrantrathore Thanks for your detailed answer! BTW, to use the provided openchat server with tensor parallel over multiple GPUs, you can set the tensor parallel argument, e.g.
|
Hi there and first of all thanks for this great tool!
I was wondering if you could provide any feedback about having a single RTX 4090 24GB vs 4x 4060ti 16GB.
At the end 4x 4060ti stack tensor cores count will match 4090 tensor cores and the 4x4060ti stack will have a total 64GB of RAM instead of 24GB.
Can't tell if the 4x 4060ti stack memory bandwidth will be a bottleneck compared to a single 4090.
One last thing, will AMD GPUs be supported one day?
Thanks in advance!
The text was updated successfully, but these errors were encountered: