Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PretrainedTokenizerFast check #35835

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

CL-ModelCloud
Copy link

@CL-ModelCloud CL-ModelCloud commented Jan 22, 2025

@CL-ModelCloud CL-ModelCloud marked this pull request as ready for review January 22, 2025 10:31
@Qubitium
Copy link
Contributor

Qubitium commented Jan 22, 2025

@ArthurZucker @itazap Pretty sure the PR fixes a class string compare bug vs correct isinstance check. It fixes the issue reported here: #35832

The existing check for class is instance of PretrainedTokenizerFast will never be met since it does a simple string class name check vs base class inheritance check.

The only problem is we had to inject a code level import due to circular import.

@CL-ModelCloud CL-ModelCloud changed the title Fix the bug in tokenizer.save_pretrained when saving tokenizer_class … Fix PretrainedTokenizerFast check Jan 22, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am not sure it's worth "fixing" as it does not cause any issues appart from "consistency" but deserializing works perfectly even if you dont'have tokenizers, while otherwise I am not sure this is guarenteed.

@@ -2456,8 +2456,10 @@ def save_pretrained(

# Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
tokenizer_class = self.__class__.__name__
# import here to prevent circular import error
from .tokenization_utils_fast import PreTrainedTokenizerFast
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to avoid this, it's pretty weird

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to avoid this, it's pretty weird

Another option is to remove this block of Fast strip code completedly. Is this legacy code still needed now?

Copy link
Contributor

@Qubitium Qubitium Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are now using different, clean, method to perform the same fix.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 23, 2025

@ArthurZucker Please re-review. We are now using different, cleaner method to fix this bug and we do consider this a bug, and a rather severe bug as well: as the loading of re-saved tokenizer has fundamentally changed vs original.

        # Remove the Fast at the end if we can save the slow tokenizer
        if tokenizer_class.endswith("Fast") and (
            hasattr(self, "can_save_slow_tokenizer") and self.can_save_slow_tokenizer
        ):

Some models like recently released deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B does not provide vocab.json but has native type of a Fast tokenizer class: LlamaTokenizerFast. If this bug is not fixed, the loading of the tokenizer will break when use_fast=False is applied on the fast-stripped and down-classed tokenizer. The current code assumes all Fast models have incuded vocab.json for reasons I do not yet understand.

Full Context: We found this bug after using GPTQModel to quantize various DeepSeek models and using EvalPlus to run benchmarks on it. EvalPlus always load slow tokenizer and pass use_fast=False. Native models worked fine, qunatized models failed. We traced the bug to this tokenizer save() bug that had nothing to do with quantization nor did we modify the tokenizer so this state change is unexpected.

Please execute the following simple reproducer to re-create the tokenizer load with use_fast=False crash on main.

import tempfile
from transformers import AutoTokenizer

# this model has no vocab file, use LlamaTokenizerFast
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

with tempfile.TemporaryDirectory() as temp_dir:
    # 'tokenizer_class' in tokenizer_config.json is LlamaTokenizerFast now
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) # <-- works
    # print(type(tokenizer)) => LlamaTokenizerFast
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) # <-- works
    # print(type(tokenizer)) => LlamaTokenizerFast
    tokenizer.save_pretrained(temp_dir)
    # 'tokenizer_class' in tokenizer_config.json is LlamaTokenizer now, and no vocab file saved

    tokenizer = AutoTokenizer.from_pretrained(temp_dir, use_fast=False) # <-- crashes
    # can't load the tokenizer because saved tokenizer path has no vocab file
Traceback (most recent call last):
  File "/GPTQModel/sample_test.py", line 18, in <module>
    tokenizer = AutoTokenizer.from_pretrained(temp_dir, use_fast=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 921, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2032, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2272, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lrl/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: not a string

Lastly, as we are not experts at tokenizer loading, this PR may also be fixing the correct bug at the wrong place. This possibility has not escaped me and I have a tingling feeling there is more to this bug than just this PR.

@Qubitium
Copy link
Contributor

@ArthurZucker After re-looking at the latest code changes, can we just remove these check and strip altogether? Downgrading the tokenizer class type does break things as we have detailed above. It just makes more sense to remove this block of code completely rather than patch-fixing? What do you think.

@Qubitium
Copy link
Contributor

Qubitium commented Jan 23, 2025

@ArthurZucker Sorry to to flood but we checked again both tokenizer load and save and here is the crust of the problem:

  1. During tokenizer load, Fast tokenizer is prioritized.
  2. If tokenizer_config.json has tokenizer_class and is Fast tokenizer, hf code will never use non-fast even if user passed use_fast=False. There appears to be a disconnect here if user explicitely ask for use_fast=False. At least if an warning should be issued here? Warning like "This tokenizer has no slow tokenizer, disregarding use_fast=False"?
  3. FastTokenizer can be loaded with use_fast=[True, False]
  4. Save tokenizer, main, breaks this by downcasting Fast tokenizer to Slow tokenizer where use_fast=False cannot be run if tokenizer is pure-fast, without vocab.json for slow-tokenizer compat.
  5. use_fast=True works on the saved but fast-stripped tokenizer because the code, below, will auto attach Fast when loading to always auto-detect if fast tokenizer can be used.
  6. WIth fast-stripped toeknizer class, use_fast=False is never checked in same 925-936 lines of code so it loads using non-fast tokenizer class code and fails at missing vocab.json.

Ref:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/auto/tokenization_auto.py#L925-L936

    elif config_tokenizer_class is not None:
            tokenizer_class = None
            if use_fast and not config_tokenizer_class.endswith("Fast"):
                tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
                tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
            if tokenizer_class is None:
                tokenizer_class_candidate = config_tokenizer_class
                tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
            if tokenizer_class is None:
                raise ValueError(
                    f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
                )
            return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tokenizer_class: LlamaTokenizerFast becomes LlamaTokenizer after load + immediate save
4 participants