-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update LoRA.py #4184
base: main
Are you sure you want to change the base?
Update LoRA.py #4184
Conversation
Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )
Does this work for Kquants + offloading yet? Because when I tried it, the model required to be F16 and all on CPU. |
Hi, I didn't test it with different params yet. I still evaluating it...
…On Fri, Oct 6, 2023, 17:32 Forkoz ***@***.***> wrote:
Does this work for Kquants + offloading yet? Because when I tried it, the
model required to be F16 and all on CPU.
—
Reply to this email directly, view it on GitHub
<#4184 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWVIYZTOP7L3KP25JYUYETX6AQATAVCNFSM6AAAAAA5ULWC2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQHEYTMNJTGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hmmm, I did the same work before I saw this pr. |
That makes 3 of us.. I added it where it loads the llama.cpp model originally and then just reload to add or remove the lora. In this way you don't have to move all of those parameters to lora.py. Unfortunately I was met with the message of having to use FP16 model or not use GPU offloading, that made it very useless. Have had better luck merging the lora directly into Kquants after conversions. The models don't increase perplexity very much and can be fully offloaded. |
Mine is the same approach, using the llama.cpp's --lora parameter and reload. I don't know how to merge it directly into Kquants. |
https://github.com/xaedes/llama.cpp/tree/finetune-lora/examples/export-lora works after converting. |
Is there a way of doing it without reloading the entire model? |
In theory yes, but I tried and didn't work to me. Maybe you can retry if
the version of llama.cpp has changed. It should be in the code as commented.
…On Sun, Oct 22, 2023, 21:08 oobabooga ***@***.***> wrote:
Is there a way of doing it without reloading the entire model?
—
Reply to this email directly, view it on GitHub
<#4184 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWVIY4QWBVEDUJ3LCXDCA3YAVVMDAVCNFSM6AAAAAA5ULWC2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZUGE3TINBUGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
AFAIK, the way it is in llama.cpp (and python bindings), the lora loads along with the model. Unfortunately it still asks for F16 weights. |
if len(lora_names) == 0: | ||
shared.lora_names = [] | ||
return | ||
else: | ||
if len(lora_names) > 1: | ||
logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.') | ||
|
||
lora_path = get_lora_path(lora_names[0]) | ||
lora_adapter_path = str(lora_path / "ggml-adapter-model.bin") | ||
|
||
logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]]))) | ||
|
||
if shared.args.tensor_split is None or shared.args.tensor_split.strip() == '': | ||
tensor_split_list = None | ||
else: | ||
tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")] | ||
|
||
params = { | ||
'model_path': str(shared.model.model.model_path), | ||
'lora_path': str(lora_adapter_path), | ||
'n_ctx': shared.args.n_ctx, | ||
'seed': int(shared.args.llama_cpp_seed), | ||
'n_threads': shared.args.threads or None, | ||
'n_threads_batch': shared.args.threads_batch or None, | ||
'n_batch': shared.args.n_batch, | ||
'use_mmap': not shared.args.no_mmap, | ||
'use_mlock': shared.args.mlock, | ||
'mul_mat_q': shared.args.mul_mat_q, | ||
'numa': shared.args.numa, | ||
'n_gpu_layers': shared.args.n_gpu_layers, | ||
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base), | ||
'tensor_split': tensor_split_list, | ||
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb, | ||
} | ||
|
||
shared.model.model = llama_cpp.Llama(**params) | ||
|
||
# shared.model.model.lora_path = lora_adapter_path | ||
# if llama_cpp.llama_model_apply_lora_from_file( | ||
# shared.model.model.model, | ||
# shared.model.model.lora_path.encode("utf-8"), | ||
# scale=1, | ||
# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0), | ||
# n_threads=shared.model.model.n_threads, | ||
# ): | ||
# raise RuntimeError( | ||
# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}" | ||
# ) | ||
|
||
shared.lora_names = [lora_names[0]] | ||
|
||
logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.") | ||
|
||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @srossitto79,
I have a few suggestions here:
- you can just check
lora_names
withnot
- while you already have
return
there, you don't need to have theelse
block. You can just remove anelse
and decrease the indentation accordingly - using an additional variable for the
lora_name
might be reasonable ', '.join([lora_names[0]])
makes no sense because the result will be equal tolora_names[0]
since there is only one element in the list- the
return
at the end is also redundant.
if len(lora_names) == 0: | |
shared.lora_names = [] | |
return | |
else: | |
if len(lora_names) > 1: | |
logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.') | |
lora_path = get_lora_path(lora_names[0]) | |
lora_adapter_path = str(lora_path / "ggml-adapter-model.bin") | |
logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]]))) | |
if shared.args.tensor_split is None or shared.args.tensor_split.strip() == '': | |
tensor_split_list = None | |
else: | |
tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")] | |
params = { | |
'model_path': str(shared.model.model.model_path), | |
'lora_path': str(lora_adapter_path), | |
'n_ctx': shared.args.n_ctx, | |
'seed': int(shared.args.llama_cpp_seed), | |
'n_threads': shared.args.threads or None, | |
'n_threads_batch': shared.args.threads_batch or None, | |
'n_batch': shared.args.n_batch, | |
'use_mmap': not shared.args.no_mmap, | |
'use_mlock': shared.args.mlock, | |
'mul_mat_q': shared.args.mul_mat_q, | |
'numa': shared.args.numa, | |
'n_gpu_layers': shared.args.n_gpu_layers, | |
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base), | |
'tensor_split': tensor_split_list, | |
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb, | |
} | |
shared.model.model = llama_cpp.Llama(**params) | |
# shared.model.model.lora_path = lora_adapter_path | |
# if llama_cpp.llama_model_apply_lora_from_file( | |
# shared.model.model.model, | |
# shared.model.model.lora_path.encode("utf-8"), | |
# scale=1, | |
# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0), | |
# n_threads=shared.model.model.n_threads, | |
# ): | |
# raise RuntimeError( | |
# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}" | |
# ) | |
shared.lora_names = [lora_names[0]] | |
logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.") | |
return | |
if not lora_names: | |
shared.lora_names = [] | |
return | |
if len(lora_names) > 1: | |
logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.') | |
lora_name = lora_names[0] | |
lora_path = get_lora_path(lora_name) | |
lora_adapter_path = str(lora_path / "ggml-adapter-model.bin") | |
logger.info(f"Applying the following LoRAs to {shared.model_name}: {lora_name}") | |
if not (shared.args.tensor_split and shared.args.tensor_split.strip()): | |
tensor_split_list = None | |
else: | |
tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")] | |
params = { | |
'model_path': str(shared.model.model.model_path), | |
'lora_path': str(lora_adapter_path), | |
'n_ctx': shared.args.n_ctx, | |
'seed': int(shared.args.llama_cpp_seed), | |
'n_threads': shared.args.threads or None, | |
'n_threads_batch': shared.args.threads_batch or None, | |
'n_batch': shared.args.n_batch, | |
'use_mmap': not shared.args.no_mmap, | |
'use_mlock': shared.args.mlock, | |
'mul_mat_q': shared.args.mul_mat_q, | |
'numa': shared.args.numa, | |
'n_gpu_layers': shared.args.n_gpu_layers, | |
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base), | |
'tensor_split': tensor_split_list, | |
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb, | |
} | |
shared.model.model = llama_cpp.Llama(**params) | |
# shared.model.model.lora_path = lora_adapter_path | |
# if llama_cpp.llama_model_apply_lora_from_file( | |
# shared.model.model.model, | |
# shared.model.model.lora_path.encode("utf-8"), | |
# scale=1, | |
# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0), | |
# n_threads=shared.model.model.n_threads, | |
# ): | |
# raise RuntimeError( | |
# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}" | |
# ) | |
shared.lora_names = [lora_name] | |
logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.") |
anything new here? |
Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )
Checklist: