Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update LoRA.py #4184

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update LoRA.py #4184

wants to merge 1 commit into from

Conversation

srossitto79
Copy link

@srossitto79 srossitto79 commented Oct 5, 2023

Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )

Checklist:

Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Oct 6, 2023

Does this work for Kquants + offloading yet? Because when I tried it, the model required to be F16 and all on CPU.

@srossitto79
Copy link
Author

srossitto79 commented Oct 7, 2023 via email

@Touch-Night
Copy link
Contributor

Hmmm, I did the same work before I saw this pr.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Oct 13, 2023

That makes 3 of us.. I added it where it loads the llama.cpp model originally and then just reload to add or remove the lora. In this way you don't have to move all of those parameters to lora.py.

Unfortunately I was met with the message of having to use FP16 model or not use GPU offloading, that made it very useless. Have had better luck merging the lora directly into Kquants after conversions. The models don't increase perplexity very much and can be fully offloaded.

@Touch-Night
Copy link
Contributor

That makes 3 of us.. I added it where it loads the llama.cpp model originally and then just reload to add or remove the lora. In this way you don't have to move all of those parameters to lora.py.

Unfortunately I was met with the message of having to use FP16 model or not use GPU offloading, that made it very useless. Have had better luck merging the lora directly into Kquants after conversions. The models don't increase perplexity very much and can be fully offloaded.

Mine is the same approach, using the llama.cpp's --lora parameter and reload. I don't know how to merge it directly into Kquants.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Oct 14, 2023

@oobabooga
Copy link
Owner

Is there a way of doing it without reloading the entire model?

@srossitto79
Copy link
Author

srossitto79 commented Oct 22, 2023 via email

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Oct 22, 2023

AFAIK, the way it is in llama.cpp (and python bindings), the lora loads along with the model. Unfortunately it still asks for F16 weights.

Comment on lines +104 to +157
if len(lora_names) == 0:
shared.lora_names = []
return
else:
if len(lora_names) > 1:
logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')

lora_path = get_lora_path(lora_names[0])
lora_adapter_path = str(lora_path / "ggml-adapter-model.bin")

logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))

if shared.args.tensor_split is None or shared.args.tensor_split.strip() == '':
tensor_split_list = None
else:
tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")]

params = {
'model_path': str(shared.model.model.model_path),
'lora_path': str(lora_adapter_path),
'n_ctx': shared.args.n_ctx,
'seed': int(shared.args.llama_cpp_seed),
'n_threads': shared.args.threads or None,
'n_threads_batch': shared.args.threads_batch or None,
'n_batch': shared.args.n_batch,
'use_mmap': not shared.args.no_mmap,
'use_mlock': shared.args.mlock,
'mul_mat_q': shared.args.mul_mat_q,
'numa': shared.args.numa,
'n_gpu_layers': shared.args.n_gpu_layers,
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
'tensor_split': tensor_split_list,
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
}

shared.model.model = llama_cpp.Llama(**params)

# shared.model.model.lora_path = lora_adapter_path
# if llama_cpp.llama_model_apply_lora_from_file(
# shared.model.model.model,
# shared.model.model.lora_path.encode("utf-8"),
# scale=1,
# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0),
# n_threads=shared.model.model.n_threads,
# ):
# raise RuntimeError(
# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}"
# )

shared.lora_names = [lora_names[0]]

logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.")

return
Copy link

@Jamim Jamim Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @srossitto79,
I have a few suggestions here:

  • you can just check lora_names with not
  • while you already have return there, you don't need to have the else block. You can just remove an else and decrease the indentation accordingly
  • using an additional variable for the lora_name might be reasonable
  • ', '.join([lora_names[0]]) makes no sense because the result will be equal to lora_names[0] since there is only one element in the list
  • the return at the end is also redundant.
Suggested change
if len(lora_names) == 0:
shared.lora_names = []
return
else:
if len(lora_names) > 1:
logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')
lora_path = get_lora_path(lora_names[0])
lora_adapter_path = str(lora_path / "ggml-adapter-model.bin")
logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))
if shared.args.tensor_split is None or shared.args.tensor_split.strip() == '':
tensor_split_list = None
else:
tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")]
params = {
'model_path': str(shared.model.model.model_path),
'lora_path': str(lora_adapter_path),
'n_ctx': shared.args.n_ctx,
'seed': int(shared.args.llama_cpp_seed),
'n_threads': shared.args.threads or None,
'n_threads_batch': shared.args.threads_batch or None,
'n_batch': shared.args.n_batch,
'use_mmap': not shared.args.no_mmap,
'use_mlock': shared.args.mlock,
'mul_mat_q': shared.args.mul_mat_q,
'numa': shared.args.numa,
'n_gpu_layers': shared.args.n_gpu_layers,
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
'tensor_split': tensor_split_list,
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
}
shared.model.model = llama_cpp.Llama(**params)
# shared.model.model.lora_path = lora_adapter_path
# if llama_cpp.llama_model_apply_lora_from_file(
# shared.model.model.model,
# shared.model.model.lora_path.encode("utf-8"),
# scale=1,
# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0),
# n_threads=shared.model.model.n_threads,
# ):
# raise RuntimeError(
# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}"
# )
shared.lora_names = [lora_names[0]]
logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.")
return
if not lora_names:
shared.lora_names = []
return
if len(lora_names) > 1:
logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')
lora_name = lora_names[0]
lora_path = get_lora_path(lora_name)
lora_adapter_path = str(lora_path / "ggml-adapter-model.bin")
logger.info(f"Applying the following LoRAs to {shared.model_name}: {lora_name}")
if not (shared.args.tensor_split and shared.args.tensor_split.strip()):
tensor_split_list = None
else:
tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")]
params = {
'model_path': str(shared.model.model.model_path),
'lora_path': str(lora_adapter_path),
'n_ctx': shared.args.n_ctx,
'seed': int(shared.args.llama_cpp_seed),
'n_threads': shared.args.threads or None,
'n_threads_batch': shared.args.threads_batch or None,
'n_batch': shared.args.n_batch,
'use_mmap': not shared.args.no_mmap,
'use_mlock': shared.args.mlock,
'mul_mat_q': shared.args.mul_mat_q,
'numa': shared.args.numa,
'n_gpu_layers': shared.args.n_gpu_layers,
'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
'tensor_split': tensor_split_list,
'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
}
shared.model.model = llama_cpp.Llama(**params)
# shared.model.model.lora_path = lora_adapter_path
# if llama_cpp.llama_model_apply_lora_from_file(
# shared.model.model.model,
# shared.model.model.lora_path.encode("utf-8"),
# scale=1,
# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0),
# n_threads=shared.model.model.n_threads,
# ):
# raise RuntimeError(
# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}"
# )
shared.lora_names = [lora_name]
logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.")

@Gee1111
Copy link

Gee1111 commented Feb 12, 2024

anything new here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants