Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer_class: LlamaTokenizerFast becomes LlamaTokenizer after load + immediate save #35832

Open
2 of 4 tasks
Qubitium opened this issue Jan 22, 2025 · 4 comments · May be fixed by #35835
Open
2 of 4 tasks

tokenizer_class: LlamaTokenizerFast becomes LlamaTokenizer after load + immediate save #35832

Qubitium opened this issue Jan 22, 2025 · 4 comments · May be fixed by #35835
Labels

Comments

@Qubitium
Copy link
Contributor

System Info

I do not understand why but saving a loaded tokenizer changes the tokenizer class type. Unsure this is a usage error on my part of expected output from HF.

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("DeepSeek/DeepSeek-R1-Distill-Qwen-7B")
tokenizer.save_pretrained('./tokenizer_copy')

tokenizer_config.json

  • before save "tokenizer_class": "LlamaTokenizerFast"
  • after save "tokenizer_class": "LlamaTokenizer"

Expected behavior

Tokenizer Class stays the same

@Qubitium Qubitium added the bug label Jan 22, 2025
@Qubitium
Copy link
Contributor Author

Qubitium commented Jan 22, 2025

I don't understand why Fast has special code and name stripped here:

# Remove the Fast at the end unless we have a special `PreTrainedTokenizerFast`
if tokenizer_class.endswith("Fast") and tokenizer_class != "PreTrainedTokenizerFast":
tokenizer_class = tokenizer_class[:-4]

I think the code should be using isInstanceOf and not string class name check?

@CL-ModelCloud CL-ModelCloud linked a pull request Jan 22, 2025 that will close this issue
@ArthurZucker
Copy link
Collaborator

Having a look, thanks!

@ArthurZucker
Copy link
Collaborator

TBH this has not been touched in 4 years it seems. Main reason is that if you keep the non fast you can reload and does not really make a big difference

@Qubitium
Copy link
Contributor Author

Qubitium commented Jan 23, 2025

Duplicating PR notes here:

Some models like recently released deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B does not provide vocab.json but has native type of a Fast tokenizer class: LlamaTokenizerFast. If this bug is not fixed, the loading of the tokenizer will break when use_fast=False is applied on the fast-stripped and down-classed tokenizer. The current code assumes all Fast models have incuded vocab.json for reasons I do not yet understand.

Full Context: We found this bug after using GPTQModel to quantize various DeepSeek models and using EvalPlus to run benchmarks on it. EvalPlus always load slow tokenizer and pass use_fast=False. Native models worked fine, qunatized models failed. We traced the bug to this tokenizer save() bug that had nothing to do with quantization nor did we modify the tokenizer so this state change is unexpected.

EDIT: May we should just remove this Fast down-cast all together?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants