Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama 3 tokenizer #850

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Add llama 3 tokenizer #850

wants to merge 3 commits into from

Conversation

sychen52
Copy link
Contributor

No description provided.

@@ -67,7 +67,7 @@ class Version(enum.Enum):
VOCAB_SIZE = {
Version.V1: 32 * 1024,
Version.V2: 32 * 1024,
Version.V3: 128256,
Version.V3: 128 * 1024,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llama 3 is supposed to be 128256. Are there any plans to stick to standard llama 3?

Copy link
Contributor Author

@sychen52 sychen52 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add -tiktoken configs and they will use 128256.
the configs without -tiktoken will stay unchanged. vocab_size is still 128*1024 and using bpe tokenizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice! I didn't catch that part.

@ruomingp ruomingp requested a review from kelvin-zou November 19, 2024 21:20
Copy link
Contributor

@ruomingp ruomingp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you confirm with @kelvin-zou that the new configs still work on TPUs and GPUs?

Copy link
Contributor

@kelvin-zou kelvin-zou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new tokenizer.

Copy link
Contributor

@markblee markblee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor follow-up comments.

@@ -301,6 +305,10 @@ def model_config(
lm_head=lm_head_cfg,
dropout_rate=dropout_rate,
)
if pad_token_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pad_token_id=0 is a valid token:

Suggested change
if pad_token_id:
if pad_token_id is not None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -301,6 +305,10 @@ def model_config(
lm_head=lm_head_cfg,
dropout_rate=dropout_rate,
)
if pad_token_id:
decoder_cfg.set(pad_token_id=pad_token_id)
if eos_token_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if eos_token_id:
if eos_token_id is not None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

128256: "bpe_128k_c4.model",
}

def _vocab_cfg(size: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -

Suggested change
def _vocab_cfg(size: int):
def _vocab_cfg(vocab_size: int):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if size == 128256:
# TikToken.
return config_for_class(FujiV3Vocabulary).set(filename="Llama-3-tokenizer.json")
raise ValueError(f"size {size} tokenizer does not exist.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError(f"size {size} tokenizer does not exist.")
raise ValueError(f"Tokenizer with vocab size {size} does not exist.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +449 to +461
pad_token_id: Optional[int] = None,
eos_token_id: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docstring accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -423,6 +421,9 @@ def get_trainer_kwargs(
raise NotImplementedError(f"Unknown model size {model_size}.")
model_kwargs = trainer_kwargs.pop("model_kwargs")
model_kwargs.setdefault("vocab_size", vocab_size)
if version == Version.V3 and vocab_size == 128256: # tiktoken tokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a private const for the tiktoken size?

_TIKTOKEN_VOCAB_SIZE = 128256

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a new version called V3_tiktoken. I think it will make things cleaner.

vocab_size = VOCAB_SIZE[version]
if tiktoken:
suffix += "-tiktoken"
vocab_size = 128256
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the const here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -27,6 +27,65 @@
# Use cpu for the test.
jax.config.update("jax_platform_name", "cpu")

config_dict_1b = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the json config as a reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

filename = os.path.join(data_dir, "tokenizers", "hf", filename)
if filename.startswith("gs:") or filename.startswith("s3:"):
# Create a different file for each usage.
tmp = tempfile.mkdtemp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cleanup tmp after tokenizer has been initialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

model.decoder.transformer.repeat.drop_output.fn: 'axlearn.common.repeat._drop_by_regex'
model.decoder.transformer.repeat.drop_output.rules[0]: 'module_outputs.*'
model.decoder.transformer.repeat.klass: 'axlearn.common.attention._TransformerRepeat'
model.decoder.vocab_size: 128256
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samos123 for example here it is using tiktoken and vocab_size is 128256.

@@ -301,6 +305,10 @@ def model_config(
lm_head=lm_head_cfg,
dropout_rate=dropout_rate,
)
if pad_token_id:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -301,6 +305,10 @@ def model_config(
lm_head=lm_head_cfg,
dropout_rate=dropout_rate,
)
if pad_token_id:
decoder_cfg.set(pad_token_id=pad_token_id)
if eos_token_id:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

128256: "bpe_128k_c4.model",
}

def _vocab_cfg(size: int):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if size == 128256:
# TikToken.
return config_for_class(FujiV3Vocabulary).set(filename="Llama-3-tokenizer.json")
raise ValueError(f"size {size} tokenizer does not exist.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +449 to +461
pad_token_id: Optional[int] = None,
eos_token_id: Optional[int] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -423,6 +421,9 @@ def get_trainer_kwargs(
raise NotImplementedError(f"Unknown model size {model_size}.")
model_kwargs = trainer_kwargs.pop("model_kwargs")
model_kwargs.setdefault("vocab_size", vocab_size)
if version == Version.V3 and vocab_size == 128256: # tiktoken tokenizer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a new version called V3_tiktoken. I think it will make things cleaner.

vocab_size = VOCAB_SIZE[version]
if tiktoken:
suffix += "-tiktoken"
vocab_size = 128256
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -27,6 +27,65 @@
# Use cpu for the test.
jax.config.update("jax_platform_name", "cpu")

config_dict_1b = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

filename = os.path.join(data_dir, "tokenizers", "hf", filename)
if filename.startswith("gs:") or filename.startswith("s3:"):
# Create a different file for each usage.
tmp = tempfile.mkdtemp()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@sychen52 sychen52 force-pushed the llama branch 2 times, most recently from 3e8d4f9 to 81f3e7b Compare December 3, 2024 18:40
@sychen52 sychen52 requested a review from a team as a code owner December 25, 2024 16:42
@sychen52 sychen52 force-pushed the llama branch 3 times, most recently from 2b0368d to 0d2b2d7 Compare December 30, 2024 17:29
@sychen52 sychen52 force-pushed the llama branch 9 times, most recently from f6ba43b to d501c67 Compare January 7, 2025 23:40
sychen52 and others added 2 commits January 8, 2025 09:57
add a new version called V3_TIKTOKEN.

other edits based on suggestions.
@sychen52
Copy link
Contributor Author

sychen52 commented Jan 9, 2025

@ruomingp , I have verified with @kelvin-zou that this works on TPU and GPU.
Note that, with this PR, I did not delete any of the old config in case we have issues with the new tokenizer. So people can still use the old config (the ones without -tiktoken) with sentencepiece tokenizer.
In the future, if we are happy with this new tokenizer, we can remove the old config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants