Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some unit tests for special tokens #6

Open
ClementLokad opened this issue Apr 17, 2024 · 4 comments
Open

Add some unit tests for special tokens #6

ClementLokad opened this issue Apr 17, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@ClementLokad
Copy link
Collaborator

Hello,

Could you add some unit tests to verify that tokens are encoded differently when they are special or not please? Maybe by using the a SpecialTokenMap that changes the default values of XlmRobertaVocab.

@ClementLokad
Copy link
Collaborator Author

Also unskip the unit test TestCreateObject

@EslaMx7
Copy link
Collaborator

EslaMx7 commented Apr 23, 2024

@ClementLokad Can you please provide more details and examples?

@ClementLokad
Copy link
Collaborator Author

For example, the sentence "Wondering how this will get tokenized 🤔 ?" is not encoded the same way if the following json is passed as the third parameter of XLMRobertaTokenizer

{
  "BosToken": "<s>",
  "ClsToken": "<s>",
  "EosToken": "</s>",
  "MaskToken": "an",
  "PadToken": "<pad>",
  "SepToken": "on",
  "UnkToken": "<unk>"
}

Also, I see in the rust lib https://github.com/guillaume-be/rust-tokenizers/blob/main/main/src/vocab/xlm_roberta_vocab.rs#L219, not all the special token map items (bos_token and mask_token) are added into values compared to here https://github.com/Lokad/Tokenizers/blob/master/src/Lokad.Tokenizers/Vocab/XlmRobertaVocab.cs#L66

@EslaMx7 EslaMx7 self-assigned this May 6, 2024
@EslaMx7 EslaMx7 added the enhancement New feature or request label May 6, 2024
@ClementLokad
Copy link
Collaborator Author

ClementLokad commented May 14, 2024

Hello Eslam,
Hope you are doing well, did you got the time to look at this issue and the other one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants