Fix inconsistent tokenization #2

maharajbrahma · 2023-08-31T06:26:54Z

12.6 should not split
22थी should split
थी22 should split

swaubhik · 2023-09-05T15:48:14Z

import re

def bodo_tokenizer(text):
    # Regular expression to match Bodo language tokens
    pattern = r'(\d+\.\d+)|([\d.]+)|([\u0980-\u09FF]+)|(\S+)'
    
    # Find all matches using the regex pattern
    tokens = [match.group(0) for match in re.finditer(pattern, text)]
    
    return tokens

# Test the tokenizer with some examples
text1 = "12.6 थी22 22थी"
tokens1 = bodo_tokenizer(text1)
print(tokens1)  # Output: ['12.6', 'थी22', '22', 'थी']

text2 = "थी 12.6 थी 22थी"
tokens2 = bodo_tokenizer(text2)
print(tokens2)  # Output: ['थी', '12.6', 'थी', '22', 'थी']

swaubhik · 2023-09-11T09:43:03Z

12,600 this should not split
21,थी this should split

maharajbrahma assigned swaubhik Aug 31, 2023

maharajbrahma added the bug Something isn't working label Aug 31, 2023

swaubhik linked a pull request Sep 5, 2023 that will close this issue

🚧 WIP: checking new tockeniser #3

Closed

swaubhik pinned this issue Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inconsistent tokenization #2

Fix inconsistent tokenization #2

maharajbrahma commented Aug 31, 2023

swaubhik commented Sep 5, 2023 •

edited

Loading

swaubhik commented Sep 11, 2023

Fix inconsistent tokenization #2

Fix inconsistent tokenization #2

Comments

maharajbrahma commented Aug 31, 2023

swaubhik commented Sep 5, 2023 • edited Loading

swaubhik commented Sep 11, 2023

swaubhik commented Sep 5, 2023 •

edited

Loading