Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGEX used to split accented characters #11

Open
sergiomajluf opened this issue Feb 8, 2019 · 0 comments
Open

REGEX used to split accented characters #11

sergiomajluf opened this issue Feb 8, 2019 · 0 comments

Comments

@sergiomajluf
Copy link

sergiomajluf commented Feb 8, 2019

As always, thanks for sharing all this code!

While using it with a spanish text, I found that it didn´t work because of special characters. In concordance.js we have this code splitting the array

// Splitting up the text
        split(text) {
            // Split into array of tokens
            return text.split(/\W+/);
        }

Unfortunately, accented (diacritic) characters are also non-alphanumeric, so word like "selección" and "niño" get chopped into "selecci", "n", "ni", "o" by using that REGEX.

I found a workaround, by using match instead of split

var re = /\S+\s*/g;
tokens = allwords.match(re);

This of course required me to also change the previous code a bit, into

txt = loadStrings('preguntas/todas.txt');
allwords = txt.join("\n");

My proposed solution is not very good either, because after splitting into tokens, I still had to hand clean for many other non-alphanumeric characters, whitespace, line breaks, etc. But that was done while "sanitizing" each word before adding it into keys, and counts. For example

for (var i = 0; i < tokens.length; i++) {
        var word = tokens[i].toLowerCase();

        // Clean some more
         word = word.replace("(", "");
         word = word.replace(")", "");
         word = word.replace(".", "");
         word = word.replace(finBlanco, "");
         word = word.replace(/(\r\n|\n|\t|\r)/gm, "");



        if (!/\d+/.test(word)) {                     // is not a number
            if (sw.indexOf(word) == -1) {            // is not a stop word within a custom sw array
                if (counts[word] === undefined) {    // is a new word
                    counts[word] = 1;
                    keys.push(word);
                } else {
                    counts[word]++;
                }
            }
        }
    }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant