You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an input file encoded in windows-1252 with the following (anonymized) contents: 1234567890,ASDF,JKL,123 WHEREVER AVE,SOMEWHERE TOWN,UBERLâNDIA,[email protected]
The case has been preserved; note the lowercased accented 'â'.
When running the contents of this file through CharDet.detect, I get back a nil encoding with 0 confidence. Running the same input through uchardet and chardetect (command line wrappers for the C++ and Python implementations of chardet, respectively) both report windows-1252.
I understand mixed case like this might be considered "weird" input, but getting a nil back as the encoding doesn't sound right to me. At the very least, rchardet appears to be doing something inconsistent with the other implementations.
The text was updated successfully, but these errors were encountered:
Aha! The problem is here, this line in latin1prober.rb: confidence = (@freqCounter[3] / total) - (@freqCounter[1] * 20.0 / total) @freqCounter[3] and total are both Fixnum and will result in an integer division. It's supposed to be computing a float value. @freqCounter[3].to_f should fix that up nicely.
Compare this behavior to uchardet; the equivalent calculation there uses a cast to float: mFreqCounter[3]*1.0f / total
I have an input file encoded in windows-1252 with the following (anonymized) contents:
1234567890,ASDF,JKL,123 WHEREVER AVE,SOMEWHERE TOWN,UBERLâNDIA,[email protected]
The case has been preserved; note the lowercased accented 'â'.
When running the contents of this file through
CharDet.detect
, I get back a nil encoding with 0 confidence. Running the same input through uchardet and chardetect (command line wrappers for the C++ and Python implementations of chardet, respectively) both report windows-1252.I understand mixed case like this might be considered "weird" input, but getting a nil back as the encoding doesn't sound right to me. At the very least, rchardet appears to be doing something inconsistent with the other implementations.
The text was updated successfully, but these errors were encountered: