Update LoRA.py #4184

srossitto79 · 2023-10-05T14:30:38Z

Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )

Checklist:

I have read the Contributing guidelines.

Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )

Ph0rk0z · 2023-10-06T15:32:14Z

Does this work for Kquants + offloading yet? Because when I tried it, the model required to be F16 and all on CPU.

srossitto79 · 2023-10-07T22:44:30Z

Hi, I didn't test it with different params yet. I still evaluating it...

…

On Fri, Oct 6, 2023, 17:32 Forkoz ***@***.***> wrote: Does this work for Kquants + offloading yet? Because when I tried it, the model required to be F16 and all on CPU. — Reply to this email directly, view it on GitHub <#4184 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWVIYZTOP7L3KP25JYUYETX6AQATAVCNFSM6AAAAAA5ULWC2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQHEYTMNJTGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Touch-Night · 2023-10-13T17:44:20Z

Hmmm, I did the same work before I saw this pr.

Ph0rk0z · 2023-10-13T19:24:51Z

That makes 3 of us.. I added it where it loads the llama.cpp model originally and then just reload to add or remove the lora. In this way you don't have to move all of those parameters to lora.py.

Unfortunately I was met with the message of having to use FP16 model or not use GPU offloading, that made it very useless. Have had better luck merging the lora directly into Kquants after conversions. The models don't increase perplexity very much and can be fully offloaded.

Touch-Night · 2023-10-14T03:29:10Z

That makes 3 of us.. I added it where it loads the llama.cpp model originally and then just reload to add or remove the lora. In this way you don't have to move all of those parameters to lora.py.

Unfortunately I was met with the message of having to use FP16 model or not use GPU offloading, that made it very useless. Have had better luck merging the lora directly into Kquants after conversions. The models don't increase perplexity very much and can be fully offloaded.

Mine is the same approach, using the llama.cpp's --lora parameter and reload. I don't know how to merge it directly into Kquants.

Ph0rk0z · 2023-10-14T15:14:21Z

https://github.com/xaedes/llama.cpp/tree/finetune-lora/examples/export-lora works after converting.

oobabooga · 2023-10-22T19:08:39Z

Is there a way of doing it without reloading the entire model?

srossitto79 · 2023-10-22T19:22:58Z

In theory yes, but I tried and didn't work to me. Maybe you can retry if the version of llama.cpp has changed. It should be in the code as commented.

…

On Sun, Oct 22, 2023, 21:08 oobabooga ***@***.***> wrote: Is there a way of doing it without reloading the entire model? — Reply to this email directly, view it on GitHub <#4184 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAWVIY4QWBVEDUJ3LCXDCA3YAVVMDAVCNFSM6AAAAAA5ULWC2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZUGE3TINBUGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Ph0rk0z · 2023-10-22T21:06:31Z

AFAIK, the way it is in llama.cpp (and python bindings), the lora loads along with the model. Unfortunately it still asks for F16 weights.

Jamim · 2023-12-05T20:59:29Z

modules/LoRA.py

+    if len(lora_names) == 0:
+        shared.lora_names = []
+        return
+    else:
+        if len(lora_names) > 1:
+            logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')      
+
+        lora_path = get_lora_path(lora_names[0])
+        lora_adapter_path = str(lora_path / "ggml-adapter-model.bin")
+
+        logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))
+
+        if shared.args.tensor_split is None or shared.args.tensor_split.strip() == '':
+            tensor_split_list = None
+        else:
+            tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")]
+
+        params = {
+            'model_path': str(shared.model.model.model_path),
+            'lora_path': str(lora_adapter_path),
+            'n_ctx': shared.args.n_ctx,
+            'seed': int(shared.args.llama_cpp_seed),
+            'n_threads': shared.args.threads or None,
+            'n_threads_batch': shared.args.threads_batch or None,
+            'n_batch': shared.args.n_batch,
+            'use_mmap': not shared.args.no_mmap,
+            'use_mlock': shared.args.mlock,
+            'mul_mat_q': shared.args.mul_mat_q,
+            'numa': shared.args.numa,
+            'n_gpu_layers': shared.args.n_gpu_layers,
+            'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),
+            'tensor_split': tensor_split_list,
+            'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,
+        }
+
+        shared.model.model = llama_cpp.Llama(**params)
+
+        # shared.model.model.lora_path = lora_adapter_path
+        # if llama_cpp.llama_model_apply_lora_from_file(
+        #     shared.model.model.model,
+        #     shared.model.model.lora_path.encode("utf-8"),
+        #     scale=1,
+        #     path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0),
+        #     n_threads=shared.model.model.n_threads,
+        # ):
+        #     raise RuntimeError(
+        #         f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}"
+        #     )
+
+        shared.lora_names = [lora_names[0]]
+
+        logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.")
+
+        return


Hello @srossitto79,
I have a few suggestions here:

you can just check lora_names with not

while you already have return there, you don't need to have the else block. You can just remove an else and decrease the indentation accordingly

using an additional variable for the lora_name might be reasonable

', '.join([lora_names[0]]) makes no sense because the result will be equal to lora_names[0] since there is only one element in the list

the return at the end is also redundant.

Suggested change

if len(lora_names) == 0:

shared.lora_names = []

return

else:

if len(lora_names) > 1:

logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')

lora_path = get_lora_path(lora_names[0])

lora_adapter_path = str(lora_path / "ggml-adapter-model.bin")

logger.info("Applying the following LoRAs to {}: {}".format(shared.model_name, ', '.join([lora_names[0]])))

if shared.args.tensor_split is None or shared.args.tensor_split.strip() == '':

tensor_split_list = None

else:

tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")]

params = {

'model_path': str(shared.model.model.model_path),

'lora_path': str(lora_adapter_path),

'n_ctx': shared.args.n_ctx,

'seed': int(shared.args.llama_cpp_seed),

'n_threads': shared.args.threads or None,

'n_threads_batch': shared.args.threads_batch or None,

'n_batch': shared.args.n_batch,

'use_mmap': not shared.args.no_mmap,

'use_mlock': shared.args.mlock,

'mul_mat_q': shared.args.mul_mat_q,

'numa': shared.args.numa,

'n_gpu_layers': shared.args.n_gpu_layers,

'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),

'tensor_split': tensor_split_list,

'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,

}

shared.model.model = llama_cpp.Llama(**params)

# shared.model.model.lora_path = lora_adapter_path

# if llama_cpp.llama_model_apply_lora_from_file(

# shared.model.model.model,

# shared.model.model.lora_path.encode("utf-8"),

# scale=1,

# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0),

# n_threads=shared.model.model.n_threads,

# ):

# raise RuntimeError(

# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}"

# )

shared.lora_names = [lora_names[0]]

logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.")

return

if not lora_names:

shared.lora_names = []

return

if len(lora_names) > 1:

logger.warning('Llama can only work with 1 LoRA at the moment. Only the first one in the list will be loaded.')

lora_name = lora_names[0]

lora_path = get_lora_path(lora_name)

lora_adapter_path = str(lora_path / "ggml-adapter-model.bin")

logger.info(f"Applying the following LoRAs to {shared.model_name}: {lora_name}")

if not (shared.args.tensor_split and shared.args.tensor_split.strip()):

tensor_split_list = None

else:

tensor_split_list = [float(x) for x in shared.args.tensor_split.strip().split(",")]

params = {

'model_path': str(shared.model.model.model_path),

'lora_path': str(lora_adapter_path),

'n_ctx': shared.args.n_ctx,

'seed': int(shared.args.llama_cpp_seed),

'n_threads': shared.args.threads or None,

'n_threads_batch': shared.args.threads_batch or None,

'n_batch': shared.args.n_batch,

'use_mmap': not shared.args.no_mmap,

'use_mlock': shared.args.mlock,

'mul_mat_q': shared.args.mul_mat_q,

'numa': shared.args.numa,

'n_gpu_layers': shared.args.n_gpu_layers,

'rope_freq_base': RoPE.get_rope_freq_base(shared.args.alpha_value, shared.args.rope_freq_base),

'tensor_split': tensor_split_list,

'rope_freq_scale': 1.0 / shared.args.compress_pos_emb,

}

shared.model.model = llama_cpp.Llama(**params)

# shared.model.model.lora_path = lora_adapter_path

# if llama_cpp.llama_model_apply_lora_from_file(

# shared.model.model.model,

# shared.model.model.lora_path.encode("utf-8"),

# scale=1,

# path_base_model=shared.model.model.lora_base.encode("utf-8") if shared.model.model.lora_base is not None else llama_cpp.c_char_p(0),

# n_threads=shared.model.model.n_threads,

# ):

# raise RuntimeError(

# f"Failed to apply LoRA from lora path: {shared.model.lora_path} to base path: {shared.model.lora_base}"

# )

shared.lora_names = [lora_name]

logger.info(f"Succesfully Applied Lora {lora_adapter_path} to Model.")

Gee1111 · 2024-02-12T14:33:22Z

anything new here?

Update LoRA.py

07813dc

Add ggml lora support for llama.cpp gguf/ggml models (convert any lora with llama.cpp/convert-lora-to-ggml.py )

Jamim suggested changes Dec 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update LoRA.py #4184

Update LoRA.py #4184

srossitto79 commented Oct 5, 2023 •

edited

Loading

Ph0rk0z commented Oct 6, 2023

srossitto79 commented Oct 7, 2023 via email

Touch-Night commented Oct 13, 2023

Ph0rk0z commented Oct 13, 2023

Touch-Night commented Oct 14, 2023

Ph0rk0z commented Oct 14, 2023

oobabooga commented Oct 22, 2023

srossitto79 commented Oct 22, 2023 via email

Ph0rk0z commented Oct 22, 2023

Jamim Dec 5, 2023 •

edited

Loading

Gee1111 commented Feb 12, 2024

Update LoRA.py #4184

Are you sure you want to change the base?

Update LoRA.py #4184

Conversation

srossitto79 commented Oct 5, 2023 • edited Loading

Checklist:

Ph0rk0z commented Oct 6, 2023

srossitto79 commented Oct 7, 2023 via email

Touch-Night commented Oct 13, 2023

Ph0rk0z commented Oct 13, 2023

Touch-Night commented Oct 14, 2023

Ph0rk0z commented Oct 14, 2023

oobabooga commented Oct 22, 2023

srossitto79 commented Oct 22, 2023 via email

Ph0rk0z commented Oct 22, 2023

Jamim Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Gee1111 commented Feb 12, 2024

srossitto79 commented Oct 5, 2023 •

edited

Loading

Jamim Dec 5, 2023 •

edited

Loading