Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE tokenizer #254

Open
rom1504 opened this issue Sep 11, 2023 · 2 comments
Open

BPE tokenizer #254

rom1504 opened this issue Sep 11, 2023 · 2 comments

Comments

@rom1504
Copy link
Collaborator

rom1504 commented Sep 11, 2023

Could be fun to have a tokenizer like "take all video frames, apply clip, transform into cluster N of 2^17 (what I have in clip retrieval index), apply BPE, return sequence"

Inspired by https://arxiv.org/abs/2309.04459

@rom1504
Copy link
Collaborator Author

rom1504 commented Sep 11, 2023

Can't find any numbers on reduction of sequence length that BPE provides

@rom1504
Copy link
Collaborator Author

rom1504 commented Sep 11, 2023

But let's say 3. That would mean encoding a video of 3600 frames as a sequence of 1200 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant