You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using it with a spanish text, I found that it didn´t work because of special characters. In concordance.js we have this code splitting the array
// Splitting up the text
split(text) {
// Split into array of tokens
return text.split(/\W+/);
}
Unfortunately, accented (diacritic) characters are also non-alphanumeric, so word like "selección" and "niño" get chopped into "selecci", "n", "ni", "o" by using that REGEX.
I found a workaround, by using match instead of split
var re = /\S+\s*/g;
tokens = allwords.match(re);
This of course required me to also change the previous code a bit, into
My proposed solution is not very good either, because after splitting into tokens, I still had to hand clean for many other non-alphanumeric characters, whitespace, line breaks, etc. But that was done while "sanitizing" each word before adding it into keys, and counts. For example
for (var i = 0; i < tokens.length; i++) {
var word = tokens[i].toLowerCase();
// Clean some more
word = word.replace("(", "");
word = word.replace(")", "");
word = word.replace(".", "");
word = word.replace(finBlanco, "");
word = word.replace(/(\r\n|\n|\t|\r)/gm, "");
if (!/\d+/.test(word)) { // is not a number
if (sw.indexOf(word) == -1) { // is not a stop word within a custom sw array
if (counts[word] === undefined) { // is a new word
counts[word] = 1;
keys.push(word);
} else {
counts[word]++;
}
}
}
}
The text was updated successfully, but these errors were encountered:
As always, thanks for sharing all this code!
While using it with a spanish text, I found that it didn´t work because of special characters. In concordance.js we have this code splitting the array
Unfortunately, accented (diacritic) characters are also non-alphanumeric, so word like "selección" and "niño" get chopped into "selecci", "n", "ni", "o" by using that REGEX.
I found a workaround, by using match instead of split
This of course required me to also change the previous code a bit, into
My proposed solution is not very good either, because after splitting into tokens, I still had to hand clean for many other non-alphanumeric characters, whitespace, line breaks, etc. But that was done while "sanitizing" each word before adding it into keys, and counts. For example
The text was updated successfully, but these errors were encountered: