Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct error message for invalid bytes in multiline strings and comments #21459

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

WillLillis
Copy link
Contributor

@WillLillis WillLillis commented Sep 20, 2024

This change provides a correct error message when an invalid byte is found inside of a multiline string line, comments, and doc comments.

Inside multiline strings Taking the example from #20900:
const foo =
    \\const S = struct {
    \\<TAB>// hello
    \\}
;

Before:

src/main.zig:4:1: error: expected ';' after statement
    \\ // hello
^

After:

src/main.zig:4:5: error: expected 'a string literal', found invalid bytes
    \\ // hello
    ^~~~~~~~~~~
src/main.zig:4:7: note: invalid byte: '\t'
    \\ // hello
      ^~~~~~~~~~~
Doc comments
/// Some <TAB>comment

Before:

src/main.zig:1:1: error: expected type expression, found 'invalid token'
/// Some  comment
^~~~~~~~~~~~~~~~~

After:

src/main.zig:1:1: error: expected 'a document comment', found invalid bytes
/// Some  comment
^~~~~~~~~~~~~~~~~
src/main.zig:1:10: note: invalid byte: '\t'
/// Some  comment
         ^~~~~~~~~~~~~~~~~
Comments
// Some <TAB>comment

Before:

src/main.zig:1:1: error: expected type expression, found 'invalid token'
// Some  comment
^~~~~~~~~~~~~~~~~

After:

src/main.zig:1:1: error: expected 'a comment', found invalid bytes
// Some  comment
^~~~~~~~~~~~~~~~~
src/main.zig:1:9: note: invalid byte: '\t'
// Some  comment
        ^~~~~~~~~~~~~~~~~

I'm not 100% confident in the set of invalid bytes I added in lowerAstErrors, as it's just based off of those listed in the multiline_string_literal_line case of Tokenizer.next(). I think this part of the PR in particular could benefit from a close second check. :)

I initially thought that the similar problem with invalid bytes inside comments (mentioned here), but there isn't a .comment variant (only .doc_comment and .container_doc_comment) of Token.Tag to report the expected token as. Another variant to Ast.Error.Tag could be added to handle this case, but I wanted to check before doing so. See followup comment.

One minor nit/ bikeshed question: The rendered error message for this case reads error: expected 'a string literal', found invalid bytes. Since this error is triggered by the first single invalid byte encountered by the tokenizer, should this message use "byte" instead?

@WillLillis
Copy link
Contributor Author

Looking at this again while rebasing, I realized handling invalid bytes inside comments was pretty straightforward. I put the associated changes in a separate commit so it can be reverted in case I took the wrong approach.

@WillLillis WillLillis changed the title fix: Correct error message for invalid bytes in multiline strings fix: Correct error message for invalid bytes in multiline strings and comments Jan 27, 2025
@WillLillis WillLillis force-pushed the invalid_byte branch 3 times, most recently from 2a699a1 to 5fa6cf3 Compare January 27, 2025 11:54
@Vexu
Copy link
Member

Vexu commented Jan 27, 2025

This doesn't need to be split into an error and a note like the previous implementation was, "comment/string contains invalid byte '<byte>'" would be more readable IMO.

@WillLillis
Copy link
Contributor Author

This doesn't need to be split into an error and a note like the previous implementation was, "comment/string contains invalid byte '<byte>'" would be more readable IMO.

I thought the error/note implementation was helpful because the note shows exactly where the (potentially invisible) invalid byte is in the token. Do you think the gain in readability is worth the tradeoff in losing this information? If so I'm happy to try to rework this soon (hopefully later today/tomorrow).

@Vexu
Copy link
Member

Vexu commented Jan 27, 2025

You could have the error point at the invalid byte. I don't think the start of the token (so //, \\ or ") is that necessary since it's always on the same line.

lib/std/zig/Ast.zig Outdated Show resolved Hide resolved
lib/std/zig/AstGen.zig Outdated Show resolved Hide resolved
Copy link
Member

@Vexu Vexu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A further improvement would be adding an unterminated string/char literal error for string and char literals where the invalid byte is EOF/NL.

lib/std/zig/AstGen.zig Outdated Show resolved Hide resolved
@WillLillis
Copy link
Contributor Author

A further improvement would be adding an unterminated string/char literal error for string and char literals where the invalid byte is EOF/NL.

That's a good idea. I implemented this locally, would it be a better to push those changes to this branch or open a followup PR once this is merged?

@Vexu Vexu merged commit cf059ee into ziglang:master Feb 5, 2025
10 checks passed
@Vexu
Copy link
Member

Vexu commented Feb 5, 2025

I think a follow up would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants