-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better documentation for //IGNORE with iconv #3
Comments
Wouldn't it be the same as |
@Des-Nerger No, |
This can now be achieved in recode 3.7.11 with iconv, using |
Thanks for the update. Could you give an example of the new recode usage? And are you sure that it skips invalid byte sequences, and not just unknown characters? (In other words the 'invalid input' error would never occur, even given UTF-8 input that had occasional junk bytes mixed in.) I ask because on my reading of the iconv documentation,
it seems to be about after the input byte sequence has been decoded into characters -- while this feature request is for recode to do some kind of retrying when it encounters errors in decoding the raw bytes. |
@epa, probably the best way to approach this is to try with |
Thanks, it appears that iconv
Now we deliberately add a junk byte at the start:
But with
although I don't understand why it reports the error at position 6 rather than position 0. It appears that |
Thanks very much for your feedback! Good news that at least |
Thanks for the update. Testing recode 3.7.11, it appears to pass through the junk byte unchanged, generating invalid UTF-8:
That looks like a bug to me: no matter what the input, if recode is asked to produce UTF-8 output then it must produce valid UTF-8 or die trying. The original feature request was to skip the garbage bytes somehow and make a best effort to produce some valid output despite them, because older versions of recode were strict and would die on invalid byte sequences, but it appears that recode has gone too far in the other direction. As for the documentation of iconv, it's difficult for me to say because I consider that the behaviour of iconv If your program is meant to receive UTF-8 input, and it then needs to convert that to Latin-1 for output to some old printer (for example), then you might want to skip and continue if there are input characters you can't handle. However, you'd still want to die with a useful message if the input just isn't valid UTF-8. The other way round, I believe the original motivation for this feature request was scraping text from websites. The website might not be very well programmed and might mix bits of other encodings with its normal UTF-8 text, giving essentially indecipherable junk bytes sprinkled through the text. I wanted my program to be robust to those, however that doesn't necessarily mean that I wanted to forgo the check of legal characters when converting to my final output encoding. I think iconv (and recode) should have a flag to handle badly encoded input, as well as possible, and this flag is independent of the chosen target encoding. It could be given with some special syntax on the input encoding, but in my opinion the names of character encodings are confusing enough already, so I'd prefer an entirely separate flag. Then it can have a way to silently drop characters which can't be represented in the chosen output encoding. Personally I'd like a separate flag for that too, it seems more user-friendly. ("Bad character xyz in output; use --skip-unencodable to suppress this error") but it could also be done with a string appended to the name of the encoding. |
(copied from pinard#14)
Sometimes recode dies with 'Invalid input'. An --ignore-invalid flag would do whatever needed to skip over junk bytes in the input, recovering whatever valid text can be found. Of course, there is more than one way to decide what to skip when decoding a multibyte encoding, so it would have to pick something broadly sensible.
I'm not envisaging a fully specified decoding for all possible junk input sequences in all possible encodings, just a best effort to extract whatever usable text remains. For UTF-8, having just read an invalid byte sequence, it could discard the first byte of the sequence and try again.
The text was updated successfully, but these errors were encountered: