Recode produces invalid UTF-8 given invalid UTF-8 #37

rrthomas · 2022-02-04T13:47:57Z

See #3. Recode versions 3.6, 3.7.9 and 3.7.11 all produce the same invalid output given invalid input:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | recode UTF-8..UTF-8
\200AαB

Since the behaviour is clearly not new it will require some study to see why it behaves as it does (is it a long-standing bug? or deliberate? or a deep-seated design problem?).

The text was updated successfully, but these errors were encountered:

epa · 2022-02-04T14:08:01Z

Strangely though, the bug doesn't happen with -ignore. That has the side effect of making the input checked for bad byte sequences, and they are skipped.

perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | ./src/recode UTF-8..UTF-8-ignore
AαB
/me/recode-3.7.11/src/.libs/lt-recode: Invalid input in step `UTF-8..UTF8-ignore'

rrthomas · 2022-02-15T21:02:52Z

@epa, please could you let me know if this now works for you as advertised, and with documentation that makes sense? (Issue #37 remains to be dealt with.)

epa · 2022-02-16T11:50:53Z

Hi, thanks for fixing this, sorry I can't test it yet because the current git version does not build for me on RHEL 8.5. help2man: no locale support (Locale::gettext required) although I do have that Perl module.

rrthomas · 2022-02-16T21:11:04Z

The problem is most likely that help2man needs to be built with gettext support (this is documented!). Merely having the perl module installed is insufficient.

epa · 2022-02-17T08:22:50Z

I'm testing the current version (62b996d). I'm not sure it works yet:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' >in
% ./src/recode UTF-8..UTF-8 <in
[produces invalid UTF-8]
% ./src/recode --strict UTF-8..UTF-8 <in
[still produces invalid UTF-8]
% ./src/recode --force UTF-8..UTF-8 <in
[the same again]
% ./src/recode UTF-8..UTF-8-ignore <in
lt-recode: Request `UTF-8..UTF-8-ignore' is erroneous

In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.

I think that if the input is specified as UTF-8 then recode should check that, and die if the input is not valid UTF-8 -- but it can be useful to have a lax mode where junk bytes in the input are skipped as best you can.

rrthomas · 2022-02-17T09:58:52Z

I'm really sorry, @epa, I commented on the wrong bug above. I have not yet tried to fix this issue, I was asking for your comments on issue #38.

rrthomas · 2022-02-17T11:51:14Z

In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.

The reason that this does not happen currently is because of recode's conversion optimization. When you request a conversion utf-8..utf-8, as you will see if you use --verbose, this is reduced to a zero-step conversion, a "mere copy". Hence, no validation of the input is performed. I think this is the only case in which invalid output is produced.

If instead a conversion is forced, by e.g.

$ recode UTF-8..UCS-4..UTF-8 <in

Then the input is validated and the problem is found (and so no invalid output is produced).

Also, I notice that with --verbose recode 3.7, unlike recode 3.6, prints the request out twice. I shall fix that.

epa · 2022-02-17T12:15:27Z

Yes, I suspected it might be something like that. But surely the main reason why a user would run recode UTF-8..UTF-8 rather than just cat is to get the extra validation that recode performs. So I think whatever optimization is done should not disable the error checks that prevent bad output.

rrthomas · 2022-02-17T14:10:48Z

It's tricky. First, recode might be invoked with UTF-8..UTF-8 as the result of some other computation, not because the user specifically requested it. Secondly, while recode is optimizing a conversion, it might generate this sort of step internally, and there's no reason to keep it.

Internally, recode has no validation of input or output that is separate from a transformation, unfortunately.

rrthomas mentioned this issue Feb 4, 2022

Better documentation for //IGNORE with iconv #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recode produces invalid UTF-8 given invalid UTF-8 #37

Recode produces invalid UTF-8 given invalid UTF-8 #37

rrthomas commented Feb 4, 2022

epa commented Feb 4, 2022

rrthomas commented Feb 15, 2022

epa commented Feb 16, 2022

rrthomas commented Feb 16, 2022

epa commented Feb 17, 2022 •

edited

Loading

rrthomas commented Feb 17, 2022

rrthomas commented Feb 17, 2022

epa commented Feb 17, 2022

rrthomas commented Feb 17, 2022

Recode produces invalid UTF-8 given invalid UTF-8 #37

Recode produces invalid UTF-8 given invalid UTF-8 #37

Comments

rrthomas commented Feb 4, 2022

epa commented Feb 4, 2022

rrthomas commented Feb 15, 2022

epa commented Feb 16, 2022

rrthomas commented Feb 16, 2022

epa commented Feb 17, 2022 • edited Loading

rrthomas commented Feb 17, 2022

rrthomas commented Feb 17, 2022

epa commented Feb 17, 2022

rrthomas commented Feb 17, 2022

epa commented Feb 17, 2022 •

edited

Loading