-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix tokenizer to work with new version of transformers library (#208)
various fixes to make tokenizers work with the latest versions of HF `transformers` and `transformer_lens` # commit history * Try and fix tokenizer to work with new version of transformers library The proposed solution is probably not backwards compatible, and is fairly hacky (it strips spaces, and I am not sure it properly assigns vocab / special tokens): There is an issue with our tokenization in the new version of transformers. In particular, in the tokenize function from transformers.tokenization_utils.py the line tokens = self.tokens_trie.split(text) returns a list of tokens with spaces if the input sequence is <path_start> (1,0)… (i.e. includes spaces). this wasn’t the case before, and I suspect stems from how I have to change the addition of the vocabulary in our tokenizer (to work with their new way of handling token addition via the _add_tokens method (we can’t just overwrite the dicts as these are now properties >.<). As a temporary fix we can manually remove spaces from sequences, but that’s quite disgusting The best option might be to create token jsons and push a tokenizer to huggingface. * Updated poetry dependencies. `poetry.lock` now has `transformers 4.38.1`, `transformer-lens 1.14.0` among many other updates. * Added `self.init_kwargs["add_bos_token"] = True` as an uninformed band-aid. Need to discuss if this makes any sense. * Tiny fix to `HuggingMazeTokenizer._tokenize` as described in the Github comment above. One unit test eliminated, other unit tests and notebook tests pass. A few notebooks are dumping their outputs directly to notebooks/ instead of a temp directory. Didn't delete them just for reference by a future fix. * Unit tests pass, my CPU won't let me run `make test` right now. * All tests pass * Updated `black` dependency to match CI version. Reran formatting. * run formatters * minor type hint fix * our special tokens aren't what HF special tokens are * re-run format?? * improved test_maze_to_tokens_roundtrip, added comparison with manually inspected tokenization * throw exception on an empty space token * moved tokenizer test to maze-dataset * format --------- Co-authored-by: aaron-sandoval <[email protected]> Co-authored-by: mivanit <[email protected]>
- Loading branch information
1 parent
b3417f9
commit 495d8d3
Showing
16 changed files
with
3,784 additions
and
1,436 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.