Fix PretrainedTokenizerFast check #35835

CL-ModelCloud · 2025-01-22T10:29:08Z

…to tokenizer_config.json

Qubitium · 2025-01-22T10:33:42Z

@ArthurZucker @itazap Pretty sure the PR fixes a class string compare bug vs correct isinstance check. It fixes the issue reported here: #35832

The existing check for class is instance of PretrainedTokenizerFast will never be met since it does a simple string class name check vs base class inheritance check.

The only problem is we had to inject a code level import due to circular import.

ArthurZucker

Thanks, I am not sure it's worth "fixing" as it does not cause any issues appart from "consistency" but deserializing works perfectly even if you dont'have tokenizers, while otherwise I am not sure this is guarenteed.

ArthurZucker · 2025-01-22T11:41:48Z

src/transformers/tokenization_utils_base.py

@@ -2456,8 +2456,10 @@ def save_pretrained(

        # Add tokenizer class to the tokenizer config to be able to reload it with from_pretrained
        tokenizer_class = self.__class__.__name__
+        # import here to prevent circular import error
+        from .tokenization_utils_fast import PreTrainedTokenizerFast


would love to avoid this, it's pretty weird

would love to avoid this, it's pretty weird

Another option is to remove this block of Fast strip code completedly. Is this legacy code still needed now?

We are now using different, clean, method to perform the same fix.

Qubitium · 2025-01-23T02:11:32Z

@ArthurZucker Please re-review. We are now using different, cleaner method to fix this bug and we do consider this a bug, and a rather severe bug as well: as the loading of re-saved tokenizer has fundamentally changed vs original.

        # Remove the Fast at the end if we can save the slow tokenizer
        if tokenizer_class.endswith("Fast") and (
            hasattr(self, "can_save_slow_tokenizer") and self.can_save_slow_tokenizer
        ):

Some models like recently released deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B does not provide vocab.json but has native type of a Fast tokenizer class: LlamaTokenizerFast. If this bug is not fixed, the loading of the tokenizer will break when use_fast=False is applied on the fast-stripped and down-classed tokenizer. The current code assumes all Fast models have incuded vocab.json for reasons I do not yet understand.

Full Context: We found this bug after using GPTQModel to quantize various DeepSeek models and using EvalPlus to run benchmarks on it. EvalPlus always load slow tokenizer and pass use_fast=False. Native models worked fine, qunatized models failed. We traced the bug to this tokenizer save() bug that had nothing to do with quantization nor did we modify the tokenizer so this state change is unexpected.

Please execute the following simple reproducer to re-create the tokenizer load with use_fast=False crash on main.

import tempfile
from transformers import AutoTokenizer

# this model has no vocab file, use LlamaTokenizerFast
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

with tempfile.TemporaryDirectory() as temp_dir:
    # 'tokenizer_class' in tokenizer_config.json is LlamaTokenizerFast now
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) # <-- works
    # print(type(tokenizer)) => LlamaTokenizerFast
    tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) # <-- works
    # print(type(tokenizer)) => LlamaTokenizerFast
    tokenizer.save_pretrained(temp_dir)
    # 'tokenizer_class' in tokenizer_config.json is LlamaTokenizer now, and no vocab file saved

    tokenizer = AutoTokenizer.from_pretrained(temp_dir, use_fast=False) # <-- crashes
    # can't load the tokenizer because saved tokenizer path has no vocab file

Traceback (most recent call last):
  File "/GPTQModel/sample_test.py", line 18, in <module>
    tokenizer = AutoTokenizer.from_pretrained(temp_dir, use_fast=False)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 921, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2032, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2272, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lrl/opt/anaconda3/envs/mlx/lib/python3.11/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: not a string

Lastly, as we are not experts at tokenizer loading, this PR may also be fixing the correct bug at the wrong place. This possibility has not escaped me and I have a tingling feeling there is more to this bug than just this PR.

Qubitium · 2025-01-23T06:57:47Z

@ArthurZucker After re-looking at the latest code changes, can we just remove these check and strip altogether? Downgrading the tokenizer class type does break things as we have detailed above. It just makes more sense to remove this block of code completely rather than patch-fixing? What do you think.

Qubitium · 2025-01-23T07:38:50Z

@ArthurZucker Sorry to to flood but we checked again both tokenizer load and save and here is the crust of the problem:

During tokenizer load, Fast tokenizer is prioritized.
If tokenizer_config.json has tokenizer_class and is Fast tokenizer, hf code will never use non-fast even if user passed use_fast=False. There appears to be a disconnect here if user explicitely ask for use_fast=False. At least if an warning should be issued here? Warning like "This tokenizer has no slow tokenizer, disregarding use_fast=False"?
FastTokenizer can be loaded with use_fast=[True, False]
Save tokenizer, main, breaks this by downcasting Fast tokenizer to Slow tokenizer where use_fast=False cannot be run if tokenizer is pure-fast, without vocab.json for slow-tokenizer compat.
use_fast=True works on the saved but fast-stripped tokenizer because the code, below, will auto attach Fast when loading to always auto-detect if fast tokenizer can be used.
WIth fast-stripped toeknizer class, use_fast=False is never checked in same 925-936 lines of code so it loads using non-fast tokenizer class code and fails at missing vocab.json.

Ref:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/auto/tokenization_auto.py#L925-L936

    elif config_tokenizer_class is not None:
            tokenizer_class = None
            if use_fast and not config_tokenizer_class.endswith("Fast"):
                tokenizer_class_candidate = f"{config_tokenizer_class}Fast"
                tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
            if tokenizer_class is None:
                tokenizer_class_candidate = config_tokenizer_class
                tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
            if tokenizer_class is None:
                raise ValueError(
                    f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
                )
            return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

CL-ModelCloud and others added 2 commits January 22, 2025 17:49

Fix the bug in tokenizer.save_pretrained when saving tokenizer_class …

2eeaef4

…to tokenizer_config.json

Update tokenization_utils_base.py

495729f

CL-ModelCloud marked this pull request as ready for review January 22, 2025 10:31

CL-ModelCloud changed the title ~~Fix the bug in tokenizer.save_pretrained when saving tokenizer_class …~~ Fix PretrainedTokenizerFast check Jan 22, 2025

Merge branch 'main' into Fix-Tokenizer-Save-Bug

e90338f

ArthurZucker reviewed Jan 22, 2025

View reviewed changes

Update tokenization_utils_base.py

8028463

Merge branch 'main' into Fix-Tokenizer-Save-Bug

06f9bb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PretrainedTokenizerFast check #35835

Fix PretrainedTokenizerFast check #35835

CL-ModelCloud commented Jan 22, 2025 •

edited

Loading

Qubitium commented Jan 22, 2025 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jan 22, 2025

Qubitium Jan 22, 2025

Qubitium Jan 23, 2025 •

edited

Loading

Qubitium commented Jan 23, 2025 •

edited

Loading

Qubitium commented Jan 23, 2025

Qubitium commented Jan 23, 2025 •

edited

Loading

Fix PretrainedTokenizerFast check #35835

Are you sure you want to change the base?

Fix PretrainedTokenizerFast check #35835

Conversation

CL-ModelCloud commented Jan 22, 2025 • edited Loading

Qubitium commented Jan 22, 2025 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 22, 2025

Choose a reason for hiding this comment

Qubitium Jan 22, 2025

Choose a reason for hiding this comment

Qubitium Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Qubitium commented Jan 23, 2025 • edited Loading

Qubitium commented Jan 23, 2025

Qubitium commented Jan 23, 2025 • edited Loading

CL-ModelCloud commented Jan 22, 2025 •

edited

Loading

Qubitium commented Jan 22, 2025 •

edited

Loading

Qubitium Jan 23, 2025 •

edited

Loading

Qubitium commented Jan 23, 2025 •

edited

Loading

Qubitium commented Jan 23, 2025 •

edited

Loading