Fix sequence clipping bug in tokenizer #45

justin-barton · 2023-09-09T00:41:54Z

Currently when using the tokenizer's batch_encode method with tensors and special tokens returned, the last two residues are truncated from the longest sequences. For example:

from protein_lm.tokenizer import AptTokenizer

tokenizer = AptTokenizer()

sequences = ["LAGERT", "SERPK"]

batch_encoded = tokenizer.batch_encode(
        sequences, 
        add_special_tokens=True, 
        return_tensors=True
    )

print(tokenizer.decode(batch_encoded[0]))

Outputs:

<cls>LAGE<eos>

The text was updated successfully, but these errors were encountered:

justin-barton · 2023-09-09T00:46:04Z

/take

github-actions bot assigned justin-barton Sep 9, 2023

justin-barton changed the title ~~Fix length clipping bug in tokenizer~~ Fix sequence clipping bug in tokenizer Sep 9, 2023

justin-barton mentioned this issue Sep 9, 2023

Fixing sequence clipping bug in tokenizer #46

Merged

pascalnotin added this to project-lm-scaling Sep 21, 2023

pascalnotin moved this to In Progress in project-lm-scaling Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sequence clipping bug in tokenizer #45

Fix sequence clipping bug in tokenizer #45

justin-barton commented Sep 9, 2023 •

edited

Loading

justin-barton commented Sep 9, 2023

Fix sequence clipping bug in tokenizer #45

Fix sequence clipping bug in tokenizer #45

Comments

justin-barton commented Sep 9, 2023 • edited Loading

justin-barton commented Sep 9, 2023

justin-barton commented Sep 9, 2023 •

edited

Loading