Replies: 2 comments 1 reply
-
The BOM is a 3 bytes mark, not 6 bytes.
The issue needs be addressed in the InputStream, not in the grammar.
… Le 20 déc. 2022 à 17:55, Hugh Gleaves ***@***.***> a écrit :
I have a grammar that works well. I just updated it so that if it sees a file that begins with the UTF-8 byte order mark, then it will simply ignore it, treat it as optional.
But when I run the antlr test rig it complains, the same complaint it generates before I added the BOM support.
Here's the updated root rule of the grammar:
translation_unit
: BYTE_ORDER_MARK? preprocessor_stmt? procedure_stmt
;
Here's the definition of that (from the same .g4 file)
BYTE_ORDER_MARK: '\u00EF\u00BB\u00BF';
And here is the file viewed in a hex editor:
<https://user-images.githubusercontent.com/12262952/208721304-b3850978-eaed-4f35-94e6-49a1895eee46.png>
This is the output seen in the console when I run the test rig:
<https://user-images.githubusercontent.com/12262952/208721622-57920e13-6082-41e5-b728-e79ba787147a.png>
I even tried specifying -encoding UTF8 when I run the test rig but that has no effect.
This is confusing the hell out of me because Microsoft do the very same thing <https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpLexer.g4> in one of their .g4 files:
<https://user-images.githubusercontent.com/12262952/208722129-3e4f435f-a878-4751-935a-e7f76e3f6f91.png>
and the parser <https://github.com/antlr/grammars-v4/blob/master/csharp/CSharpParser.g4>:
<https://user-images.githubusercontent.com/12262952/208722388-09d10071-f018-4e99-b920-c1a302834837.png>
What am I doing wrong?
—
Reply to this email directly, view it on GitHub <#4033>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNQJHMOAXQ6SR3DHESQ63WOHQJXANCNFSM6AAAAAATEWZVSA>.
You are receiving this because you are subscribed to this thread.
|
Beta Was this translation helpful? Give feedback.
-
Actually I think I understand you now. That is a 6 byte specifier and likely because the expect the consumer to read these as 16 bit unicode chars (which their Roslyn tools likely do, the .Net But I should be able to specify these and skip them, but I can't is there no way to represent this in Antlr lexer syntax:
Antlr says these are invalid escape sequences, so how does on represent this? I've searched extensively for information on whether antlr lets us specify characters in hex form and found nothing, its also very hard to search for every search just finds examples of how to define literals for grammars for other languages! |
Beta Was this translation helpful? Give feedback.
-
I have a grammar that works well. I just updated it so that if it sees a file that begins with the UTF-8 byte order mark, then it will simply ignore it, treat it as optional.
But when I run the antlr test rig it complains, the same complaint it generates before I added the BOM support.
Here's the updated root rule of the grammar:
Here's the definition of that (from the same .g4 file)
And here is the file viewed in a hex editor:
This is the output seen in the console when I run the test rig:
I even tried specifying
-encoding UTF8
when I run the test rig but that has no effect.This is confusing the hell out of me because Microsoft do the very same thing in one of their .g4 files:
and the parser:
It seems to continue parsing though after reporting the error, seems to parse the rest of the file fine, and if I consume the grammar and the same test source file using the generated C# code, that too works, I suppose the .Net streams stuff is handling and stripping away the BOM itself though, but what am I doing wrong? why does test right complain?
Beta Was this translation helpful? Give feedback.
All reactions