Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recode produces invalid UTF-8 given invalid UTF-8 #37

Open
rrthomas opened this issue Feb 4, 2022 · 9 comments
Open

Recode produces invalid UTF-8 given invalid UTF-8 #37

rrthomas opened this issue Feb 4, 2022 · 9 comments

Comments

@rrthomas
Copy link
Owner

rrthomas commented Feb 4, 2022

See #3. Recode versions 3.6, 3.7.9 and 3.7.11 all produce the same invalid output given invalid input:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | recode UTF-8..UTF-8
\200AαB

Since the behaviour is clearly not new it will require some study to see why it behaves as it does (is it a long-standing bug? or deliberate? or a deep-seated design problem?).

@epa
Copy link

epa commented Feb 4, 2022

Strangely though, the bug doesn't happen with -ignore. That has the side effect of making the input checked for bad byte sequences, and they are skipped.

perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' | ./src/recode UTF-8..UTF-8-ignore
AαB
/me/recode-3.7.11/src/.libs/lt-recode: Invalid input in step `UTF-8..UTF8-ignore'

@rrthomas
Copy link
Owner Author

@epa, please could you let me know if this now works for you as advertised, and with documentation that makes sense? (Issue #37 remains to be dealt with.)

@epa
Copy link

epa commented Feb 16, 2022

Hi, thanks for fixing this, sorry I can't test it yet because the current git version does not build for me on RHEL 8.5. help2man: no locale support (Locale::gettext required) although I do have that Perl module.

@rrthomas
Copy link
Owner Author

The problem is most likely that help2man needs to be built with gettext support (this is documented!). Merely having the perl module installed is insufficient.

@epa
Copy link

epa commented Feb 17, 2022

I'm testing the current version (62b996d). I'm not sure it works yet:

% perl -C0 -E 'say chr(128), chr(65), chr(206), chr(177), chr(66)' >in
% ./src/recode UTF-8..UTF-8 <in
[produces invalid UTF-8]
% ./src/recode --strict UTF-8..UTF-8 <in
[still produces invalid UTF-8]
% ./src/recode --force UTF-8..UTF-8 <in
[the same again]
% ./src/recode UTF-8..UTF-8-ignore <in
lt-recode: Request `UTF-8..UTF-8-ignore' is erroneous

In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.

I think that if the input is specified as UTF-8 then recode should check that, and die if the input is not valid UTF-8 -- but it can be useful to have a lax mode where junk bytes in the input are skipped as best you can.

@rrthomas
Copy link
Owner Author

I'm really sorry, @epa, I commented on the wrong bug above. I have not yet tried to fix this issue, I was asking for your comments on issue #38.

@rrthomas
Copy link
Owner Author

In my view if recode is asked to produce UTF-8 output, it should always produce UTF-8 and never junk bytes -- and this is such a basic requirement that it shouldn't depend on any force or strict flags.

The reason that this does not happen currently is because of recode's conversion optimization. When you request a conversion utf-8..utf-8, as you will see if you use --verbose, this is reduced to a zero-step conversion, a "mere copy". Hence, no validation of the input is performed. I think this is the only case in which invalid output is produced.

If instead a conversion is forced, by e.g.

$ recode UTF-8..UCS-4..UTF-8 <in

Then the input is validated and the problem is found (and so no invalid output is produced).

Also, I notice that with --verbose recode 3.7, unlike recode 3.6, prints the request out twice. I shall fix that.

@epa
Copy link

epa commented Feb 17, 2022

Yes, I suspected it might be something like that. But surely the main reason why a user would run recode UTF-8..UTF-8 rather than just cat is to get the extra validation that recode performs. So I think whatever optimization is done should not disable the error checks that prevent bad output.

@rrthomas
Copy link
Owner Author

It's tricky. First, recode might be invoked with UTF-8..UTF-8 as the result of some other computation, not because the user specifically requested it. Secondly, while recode is optimizing a conversion, it might generate this sort of step internally, and there's no reason to keep it.

Internally, recode has no validation of input or output that is separate from a transformation, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants