Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with loading model #1499

Open
Axel-At-Apollo opened this issue Jan 3, 2025 · 3 comments
Open

Help with loading model #1499

Axel-At-Apollo opened this issue Jan 3, 2025 · 3 comments

Comments

@Axel-At-Apollo
Copy link

Axel-At-Apollo commented Jan 3, 2025

See #1410, has not been resolved as of yet.

@KareemMusleh
Copy link
Contributor

I got your code to work like this

modeling_llama = importlib.reload(transformers.models.llama.modeling_llama)

model = modeling_llama.LlamaForCausalLM.from_pretrained(checkpoint_path)
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint_path)

# Add text generation
prompt = "Write a short story about a robot learning to paint:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs.input_ids,
    max_length=200,
    num_return_sequences=1,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGenerated text:")
print(generated_text)

I will try to find a more general solution

@Axel-At-Apollo
Copy link
Author

Thanks for this @KareemMusleh! This indeed works for the generation case. However when trying to use lm_eval (see first comment in #1410), I still get an error. Do you perhaps know what other parts of transformers I need to importlib.reload() to get that working?

Script:

import importlib

from lm_eval import simple_evaluate
from unsloth import FastLanguageModel

# Load with unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./llama-3-2-1b-instruct",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,
)

# Save the model and tokenizer
checkpoint_path = "other_base_llama_3b_instruct"
model.save_pretrained(checkpoint_path)
tokenizer.save_pretrained(checkpoint_path)

# Reload the transformers library to remove unsloth patches
import transformers

importlib.reload(transformers)
importlib.reload(transformers.models.llama.modeling_llama)


# Evaluate the model
results = simple_evaluate(
    model="hf",
    model_args=f"pretrained={checkpoint_path}",
    tasks=["tinyMMLU"],
)

Terminal Output:

root@20ba2c8eeea4:/app# python unslo.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.1: Fast Llama patching. Transformers: 4.46.2.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
2025-01-08:12:11:12,015 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-01-08:12:11:12,015 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'other_base_llama_3b_instruct'}
2025-01-08:12:11:12,056 INFO     [huggingface.py:132] Using device 'cuda'
2025-01-08:12:11:12,571 INFO     [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-01-08:12:11:22,748 INFO     [task.py:415] Building contexts for tinyMMLU on rank 0...
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 3227.23it/s]
2025-01-08:12:11:22,785 INFO     [evaluator.py:496] Running loglikelihood requests
Running loglikelihood requests:   0%|                                                      | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/app/unslo.py", line 27, in <module>
    results = simple_evaluate(
              ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/utils.py", line 401, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/evaluator.py", line 303, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/utils.py", line 401, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/evaluator.py", line 507, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/api/model.py", line 378, in loglikelihood
    return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/models/huggingface.py", line 1166, in _loglikelihood_tokens
    self._model_call(batched_inps, **call_kwargs), dim=-1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/models/huggingface.py", line 856, in _model_call
    return self.model(inps).logits
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/unsloth/models/llama.py", line 986, in _CausalLM_fast_forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LlamaModel.forward() got an unexpected keyword argument 'causal_mask'
Running loglikelihood requests:   0%|                                                      | 0/400 [00:00<?, ?it/s]

If I don't add the reload lines (i.e. importlib.reload(transformers) and importlib.reload(transformers.models.llama.modeling_llama)), but otherwise keep the script the same, I get a different error:

root@20ba2c8eeea4:/app# python unslo.py 
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.1: Fast Llama patching. Transformers: 4.46.2.
   \\   /|    GPU: NVIDIA A10G. Max memory: 21.975 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
2025-01-08:12:13:23,134 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-01-08:12:13:23,134 INFO     [evaluator.py:201] Initializing hf model, with arguments: {'pretrained': 'other_base_llama_3b_instruct'}
2025-01-08:12:13:23,172 INFO     [huggingface.py:132] Using device 'cuda'
2025-01-08:12:13:23,698 INFO     [huggingface.py:369] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-01-08:12:13:33,457 INFO     [task.py:415] Building contexts for tinyMMLU on rank 0...
100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 3211.98it/s]
2025-01-08:12:13:33,494 INFO     [evaluator.py:496] Running loglikelihood requests
Running loglikelihood requests:   0%|                                                      | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/app/unslo.py", line 24, in <module>
    results = simple_evaluate(
              ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/utils.py", line 401, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/evaluator.py", line 303, in simple_evaluate
    results = evaluate(
              ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/utils.py", line 401, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/evaluator.py", line 507, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/api/model.py", line 378, in loglikelihood
    return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/models/huggingface.py", line 1166, in _loglikelihood_tokens
    self._model_call(batched_inps, **call_kwargs), dim=-1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/lm_eval/models/huggingface.py", line 856, in _model_call
    return self.model(inps).logits
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/unsloth/models/llama.py", line 986, in _CausalLM_fast_forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/unsloth/models/llama.py", line 849, in LlamaModel_fast_forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/unsloth/models/llama.py", line 508, in LlamaDecoderLayer_fast_forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/unsloth/models/llama.py", line 371, in LlamaAttention_fast_forward
    Q, K, V = self.apply_qkv(self, hidden_states)
              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1931, in __getattr__
    raise AttributeError(
AttributeError: 'LlamaSdpaAttention' object has no attribute 'apply_qkv'
Running loglikelihood requests:   0%|                                                      | 0/400 [00:01<?, ?it/s]

So it definitely has an effect. I guess this would work if I just knew all of the part of transformers that I had to reload.

@danielhanchen
Copy link
Contributor

Apologies on the delay - if reloading does not work, I suggest using subprocess to load another Python process inside of Python - that should make it work.

The issue is Unsloth directly patches torch, transformers and many other libraries, so errors might be pervasive - sorry on the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants