UTF8 Byte Order Mark causing very odd problems #4033

Korporal · 2022-12-20T16:55:39Z

Korporal
Dec 20, 2022

I have a grammar that works well. I just updated it so that if it sees a file that begins with the UTF-8 byte order mark, then it will simply ignore it, treat it as optional.

But when I run the antlr test rig it complains, the same complaint it generates before I added the BOM support.

Here's the updated root rule of the grammar:

translation_unit
    : BYTE_ORDER_MARK? preprocessor_stmt? procedure_stmt
    ;

Here's the definition of that (from the same .g4 file)

BYTE_ORDER_MARK: '\u00EF\u00BB\u00BF';

And here is the file viewed in a hex editor:

This is the output seen in the console when I run the test rig:

I even tried specifying -encoding UTF8 when I run the test rig but that has no effect.

This is confusing the hell out of me because Microsoft do the very same thing in one of their .g4 files:

and the parser:

It seems to continue parsing though after reporting the error, seems to parse the rest of the file fine, and if I consume the grammar and the same test source file using the generated C# code, that too works, I suppose the .Net streams stuff is handling and stripping away the BOM itself though, but what am I doing wrong? why does test right complain?

ericvergnaud · 2022-12-20T17:32:20Z

ericvergnaud
Dec 20, 2022
Maintainer

The BOM is a 3 bytes mark, not 6 bytes. The issue needs be addressed in the InputStream, not in the grammar.

…

Le 20 déc. 2022 à 17:55, Hugh Gleaves ***@***.***> a écrit : I have a grammar that works well. I just updated it so that if it sees a file that begins with the UTF-8 byte order mark, then it will simply ignore it, treat it as optional. But when I run the antlr test rig it complains, the same complaint it generates before I added the BOM support. Here's the updated root rule of the grammar: translation_unit : BYTE_ORDER_MARK? preprocessor_stmt? procedure_stmt ; Here's the definition of that (from the same .g4 file) BYTE_ORDER_MARK: '\u00EF\u00BB\u00BF'; And here is the file viewed in a hex editor: <https://user-images.githubusercontent.com/12262952/208721304-b3850978-eaed-4f35-94e6-49a1895eee46.png> This is the output seen in the console when I run the test rig: <https://user-images.githubusercontent.com/12262952/208721622-57920e13-6082-41e5-b728-e79ba787147a.png> I even tried specifying -encoding UTF8 when I run the test rig but that has no effect. This is confusing the hell out of me because Microsoft do the very same thing <https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpLexer.g4> in one of their .g4 files: <https://user-images.githubusercontent.com/12262952/208722129-3e4f435f-a878-4751-935a-e7f76e3f6f91.png> and the parser <https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpParser.g4>: <https://user-images.githubusercontent.com/12262952/208722388-09d10071-f018-4e99-b920-c1a302834837.png> What am I doing wrong? — Reply to this email directly, view it on GitHub <#4033>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNQJHMOAXQ6SR3DHESQ63WOHQJXANCNFSM6AAAAAATEWZVSA>. You are receiving this because you are subscribed to this thread.

1 reply

Korporal Dec 20, 2022
Author

The BOM is a 3 bytes mark, not 6 bytes. The issue needs be addressed in the InputStream, not in the grammar.
…
Le 20 déc. 2022 à 17:55, Hugh Gleaves @.***> a écrit : I have a grammar that works well. I just updated it so that if it sees a file that begins with the UTF-8 byte order mark, then it will simply ignore it, treat it as optional. But when I run the antlr test rig it complains, the same complaint it generates before I added the BOM support. Here's the updated root rule of the grammar: translation_unit : BYTE_ORDER_MARK? preprocessor_stmt? procedure_stmt ; Here's the definition of that (from the same .g4 file) BYTE_ORDER_MARK: '\u00EF\u00BB\u00BF'; And here is the file viewed in a hex editor: https://user-images.githubusercontent.com/12262952/208721304-b3850978-eaed-4f35-94e6-49a1895eee46.png This is the output seen in the console when I run the test rig: https://user-images.githubusercontent.com/12262952/208721622-57920e13-6082-41e5-b728-e79ba787147a.png I even tried specifying -encoding UTF8 when I run the test rig but that has no effect. This is confusing the hell out of me because Microsoft do the very same thing https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpLexer.g4 in one of their .g4 files: https://user-images.githubusercontent.com/12262952/208722129-3e4f435f-a878-4751-935a-e7f76e3f6f91.png and the parser https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpParser.g4: https://user-images.githubusercontent.com/12262952/208722388-09d10071-f018-4e99-b920-c1a302834837.png What am I doing wrong? — Reply to this email directly, view it on GitHub <#4033>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZNQJHMOAXQ6SR3DHESQ63WOHQJXANCNFSM6AAAAAATEWZVSA. You are receiving this because you are subscribed to this thread.

The existing .g4 grammar here (in the antlr repo) defines it in exactly the same way though, I copied it from an existing antlr grammar! (I mistakenly referred to Microsoft, but it is an antlr grammar file here line 11).

Korporal · 2022-12-20T18:38:24Z

Korporal
Dec 20, 2022
Author

Actually I think I understand you now. That is a 6 byte specifier and likely because the expect the consumer to read these as 16 bit unicode chars (which their Roslyn tools likely do, the .Net char is 16 bits wide).

But I should be able to specify these and skip them, but I can't is there no way to represent this in Antlr lexer syntax:

BYTE_ORDER_MARK: '\0xEF\0xBB\0xBF';

Antlr says these are invalid escape sequences, so how does on represent this? I've searched extensively for information on whether antlr lets us specify characters in hex form and found nothing, its also very hard to search for every search just finds examples of how to define literals for grammars for other languages!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 Byte Order Mark causing very odd problems #4033

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

UTF8 Byte Order Mark causing very odd problems #4033

Korporal Dec 20, 2022

Replies: 2 comments · 1 reply

ericvergnaud Dec 20, 2022 Maintainer

Korporal Dec 20, 2022 Author

Korporal Dec 20, 2022 Author

Korporal
Dec 20, 2022

Replies: 2 comments 1 reply

ericvergnaud
Dec 20, 2022
Maintainer

Korporal Dec 20, 2022
Author

Korporal
Dec 20, 2022
Author