Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

valid input encoded as windows-1252 is detected with nil encoding #17

Closed
coledot opened this issue Nov 25, 2014 · 4 comments
Closed

valid input encoded as windows-1252 is detected with nil encoding #17

coledot opened this issue Nov 25, 2014 · 4 comments

Comments

@coledot
Copy link
Contributor

coledot commented Nov 25, 2014

I have an input file encoded in windows-1252 with the following (anonymized) contents:
1234567890,ASDF,JKL,123 WHEREVER AVE,SOMEWHERE TOWN,UBERLâNDIA,[email protected]
The case has been preserved; note the lowercased accented 'â'.
When running the contents of this file through CharDet.detect, I get back a nil encoding with 0 confidence. Running the same input through uchardet and chardetect (command line wrappers for the C++ and Python implementations of chardet, respectively) both report windows-1252.
I understand mixed case like this might be considered "weird" input, but getting a nil back as the encoding doesn't sound right to me. At the very least, rchardet appears to be doing something inconsistent with the other implementations.

@coledot
Copy link
Contributor Author

coledot commented Nov 25, 2014

Aha! The problem is here, this line in latin1prober.rb:
confidence = (@freqCounter[3] / total) - (@freqCounter[1] * 20.0 / total)
@freqCounter[3] and total are both Fixnum and will result in an integer division. It's supposed to be computing a float value. @freqCounter[3].to_f should fix that up nicely.
Compare this behavior to uchardet; the equivalent calculation there uses a cast to float: mFreqCounter[3]*1.0f / total

@coledot
Copy link
Contributor Author

coledot commented Nov 26, 2014

Fixed in this pull request: #18 (and added a spec)

@coledot
Copy link
Contributor Author

coledot commented Dec 3, 2014

@grosser would love to see this merged in, if possible :)

@grosser
Copy link
Collaborator

grosser commented Dec 3, 2014

taking discussion to the PR

@grosser grosser closed this as completed Dec 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants