-
-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probability normalization #89
Comments
That difference is exactly what you can use to check how sure franc is. For example, if you’re checking whether documents are probably in English, you could see if the score for it is
The raw score has to do with how long a value is passed in, so it’s not very interesting. |
That doesn't really work though, for example for this 2048-characters text: Sample text
I get these languages with probability > 0.85: Detected probabilities
I mean if 129 languages out of 180 supported languages are considered probable that's not very useful.
Is 0.1 a big enough threshold though? It's a bit hard for me to gauge how big a difference 0.1 makes when 120 languages are above 0.85 anyway 🤔
I could normalize them myself but I would get the top language at like 1%, maybe, that doesn't tell me much really. The top percentage should be much higher for long non-very-ambiguous documents like the one I'm feeding it in the example above.
Can it? I don't know, because looking at the probabilities there are like 30 languages detected within a 0.005 percentage range, I'm not sure how I'm supposed to gauge the sureness of the model on those languages. Even the difference between english and spanish is only like 0.031. |
I believe whatlang-rs, which is inspired by franc, does some smart things here: https://github.com/greyblake/whatlang-rs#how-does-it-work |
Currently franc to me often returns a probability close to 1 for many languages, IMO all these probabilities should be normalized to add up to 1.
Also there seems to always be a language at the top with probability 1, this makes it difficult to judge how sure the "model" is about the detection, which would be another interesting point of data to have.
The text was updated successfully, but these errors were encountered: