tokenizer_class: `LlamaTokenizerFast` becomes `LlamaTokenizer` after load + immediate save #35832

Qubitium · 2025-01-22T08:50:40Z

System Info

I do not understand why but saving a loaded tokenizer changes the tokenizer class type. Unsure this is a usage error on my part of expected output from HF.

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("DeepSeek/DeepSeek-R1-Distill-Qwen-7B")
tokenizer.save_pretrained('./tokenizer_copy')

tokenizer_config.json

before save "tokenizer_class": "LlamaTokenizerFast"
after save "tokenizer_class": "LlamaTokenizer"

Expected behavior

Tokenizer Class stays the same

The text was updated successfully, but these errors were encountered:

Qubitium · 2025-01-22T09:20:17Z

I don't understand why Fast has special code and name stripped here:

transformers/src/transformers/tokenization_utils_base.py

Lines 2459 to 2461 in 373e50e

    
           # Remove the Fast at the end unless we have a special `PreTrainedTokenizerFast` 
        
           if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast": 
        
               tokenizer_class = tokenizer_class[:-4]

I think the code should be using isInstanceOf and not string class name check?

ArthurZucker · 2025-01-22T11:35:30Z

Having a look, thanks!

ArthurZucker · 2025-01-22T11:41:32Z

TBH this has not been touched in 4 years it seems. Main reason is that if you keep the non fast you can reload and does not really make a big difference

Qubitium · 2025-01-23T06:50:56Z

Duplicating PR notes here:

Some models like recently released deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B does not provide vocab.json but has native type of a Fast tokenizer class: LlamaTokenizerFast. If this bug is not fixed, the loading of the tokenizer will break when use_fast=False is applied on the fast-stripped and down-classed tokenizer. The current code assumes all Fast models have incuded vocab.json for reasons I do not yet understand.

Full Context: We found this bug after using GPTQModel to quantize various DeepSeek models and using EvalPlus to run benchmarks on it. EvalPlus always load slow tokenizer and pass use_fast=False. Native models worked fine, qunatized models failed. We traced the bug to this tokenizer save() bug that had nothing to do with quantization nor did we modify the tokenizer so this state change is unexpected.

EDIT: May we should just remove this Fast down-cast all together?

Qubitium added the bug label Jan 22, 2025

CL-ModelCloud linked a pull request Jan 22, 2025 that will close this issue

Fix PretrainedTokenizerFast check #35835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizer_class: `LlamaTokenizerFast` becomes `LlamaTokenizer` after load + immediate save #35832

tokenizer_class: `LlamaTokenizerFast` becomes `LlamaTokenizer` after load + immediate save #35832

Qubitium commented Jan 22, 2025

Qubitium commented Jan 22, 2025 •

edited

Loading

ArthurZucker commented Jan 22, 2025

ArthurZucker commented Jan 22, 2025

Qubitium commented Jan 23, 2025 •

edited

Loading

tokenizer_class: LlamaTokenizerFast becomes LlamaTokenizer after load + immediate save #35832

tokenizer_class: LlamaTokenizerFast becomes LlamaTokenizer after load + immediate save #35832

Comments

Qubitium commented Jan 22, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Qubitium commented Jan 22, 2025 • edited Loading

ArthurZucker commented Jan 22, 2025

ArthurZucker commented Jan 22, 2025

Qubitium commented Jan 23, 2025 • edited Loading

tokenizer_class: `LlamaTokenizerFast` becomes `LlamaTokenizer` after load + immediate save #35832

tokenizer_class: `LlamaTokenizerFast` becomes `LlamaTokenizer` after load + immediate save #35832

Qubitium commented Jan 22, 2025 •

edited

Loading

Qubitium commented Jan 23, 2025 •

edited

Loading