Built-in macros for common lexer patterns #9

ericvergnaud · 2023-12-15T20:17:22Z

ericvergnaud
Dec 15, 2023
Maintainer

A number of lexer patterns are valid in almost every programming language, including DSLs.

As an example, most languages support similarly:

integers in various formats (decimal, hex, octal, binary...)
decimals in various formats (decimal, power of 10...)
double quoted strings, single quoted strings with escape sequences
identifiers (starting with an alpha, followed by alphanums | '_'...)
whitespace

Writing lexer rules for these is an amusing learning exercise, but it would be an accelarator to provide macros for them.

A lexer rule using a macro could look as follows:

INTEGER_LITERAL: #all_integer_literals;

This would be much simpler and readable than:

INTEGER_LITERAL
    : Integer | Hexadecimal
    ;    

fragment
Integer :
    '0' | [1-9] [0-9]*
    ;
    
fragment
Hexadecimal :
    ( '0x' | '0X' ) HexNibble+
    ;
    
fragment
HexNibble :    
	[0-9a-fA-F]
	;

And IDE would need the ability to provide 'macro insights' i.e. whatever concrete rules constitute the macro.

ftomassetti · 2023-12-22T08:47:46Z

ftomassetti
Dec 22, 2023
Collaborator

I think this would be a very good idea. Besides accelerating development for "pros", it would also lower the entry barrier for "newbies"

0 replies

kaby76 · 2023-12-22T13:07:20Z

kaby76
Dec 22, 2023
Collaborator

If you want built-in rules, then add it. But, I don't see the need in grammars-v4, because all these grammars describe the lexical structure for the language already, including int's, id's, string literals, etc.

We can rewrite the rules in these grammars via Trash with the built-ins, but the rewriter would need to check the RHS of the rule and verify that they are the same as what they are intended to replace. This is likely pretty easy to do (I already have a "fold" rewrite that does essentially the same thing). But then what?

This shortens the grammars by a few rules. Is that valuable? I don't know. However, if the build-in rules offered some improvement in speed, then yes, this would be a good addition.

If I want to take this grammar that now uses the built-in lexer rules and port it to another parser generator (Trash can do this), then I need to get these definitions from somewhere, because I will need to implement them in the other parser generator.

The problem I have is how Antlr grammar composition works, especially for lexer rules. They do not follow the usual semantics of any implementation of EBNF. Most people write a grammars independently of other grammars. If you try to graft together grammars, the default should be that they work independently of each other. This is the hallmark of a good programming language: referential transparency. This was the reason FORTRAN was such a disaster in the "old days". Every variable was global. You didn't know what subroutine was modifying what. Unfortunately, Antlr is really at this stage.

Currently, all the lexer rules in the default mode are pooled together in the default mode. You now have a lexer that works unpredictably because the recognition depends on import order.

I'm exploring how to add in css and javascript into the html grammar. The first step is to rewriting the lexer grammars to not use the default mode.

0 replies

ericvergnaud · 2023-12-22T13:25:05Z

ericvergnaud
Dec 22, 2023
Maintainer Author

You now have a lexer that works unpredictably because the recognition depends on import order

You are right, which is what we're looking to tackle with include

But this proposal does not affect rules order. The grammar author would still be responsible for putting them in the right sequence.
We'd need the tool to provide macro-expansion such that they can examine the actual rules.

I agree with Federico's comments. Just yesterday I was helping out a newbie on this precise topic.

1 reply

kaby76 Dec 22, 2023
Collaborator

I agree with Federico's comments. Just yesterday I was helping out a newbie on this precise topic.

What was the problem? Please explain.

If you add built-ins, please tell me how I can get the definition so I can then use the definition in grammar rewrites. Trash can manipulate grammars really well, but it can't do that if it doesn't have a definition for the symbol.

This is the problem I have with Eclipse XText. Built-ins are not in a text file. In fact, it's wrapped up in the whole damn Eclipse IDE. I can't look into that very easily.

In order to keep Antlr5 "open", I suggest a command-line tool that spits out the definitions of the symbols.

ericvergnaud · 2023-12-22T15:05:09Z

ericvergnaud
Dec 22, 2023
Maintainer Author

Here is the issue: antlr/antlr4#4498 (comment) Numbers and strings cannot be recognized correctly · Issue #4498 · antlr/antlr4 github.com It’s the typical starter issue re-inventing the wheel…

…

Le 22 déc. 2023 à 14:46, Ken Domino ***@***.***> a écrit : I agree with Federico's comments. Just yesterday I was helping out a newbie on this precise topic. What was the problem? Please explain. If you add built-ins, please tell me how I can get the definition so I can then use the definition in grammar rewrites. Trash can manipulate grammars really well, but it can't do that if it doesn't have a definition for the symbol. This is the problem I have with Eclipse XText. Built-ins are not in a text file. In fact, it's wrapped up in the whole damn Eclipse IDE. I can't look into that very easily. — Reply to this email directly, view it on GitHub <#9 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZNQJGNKFKUQ7W5VX5ZCK3YKWFLNAVCNFSM6AAAAABAW6E2X6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TSMRXHE3DQ>. You are receiving this because you authored the thread.

0 replies

kaby76 · 2023-12-22T16:05:14Z

kaby76
Dec 22, 2023
Collaborator

OK, I'm a little confused.

First off, the fellow posts antlr/antlr4#4498 but never actually asks a question. (Beginners seem to do that a lot.) But, let's assume he is asking: "Why doesn't this input parse for my grammar?" Clearly, he doesn't understand the "two golden rules how Antlr lexers work". Most beginners don't understand Antlr lexers because they keep thinking this is EBNF. No, Antlr lexer grammars don't work that way. They work independently from the parser. They match the longest rule first, then if two or more rules match, the first one "wins." This is a recurring problem we see in StackOverflow, at least one or two times a month. Bart Kiers has a lot of patience explaining that over and over.

Actually, what I thought is to have a requirement of a reusable set of rules, sort of in a package library, like npm, or something really "built-into" the runtime, for INT, STRING, etc. preloaded and available for use in a grammar, without having to define them oneself. Or is this something analogous to generics or templates? Sorry if I'm lost.

Rereading your original comment, when I hear "macros", I think C preprocessor. Is that what you are thinking of? A macro feature, or a package library, or both?? Or templates? All of these analogies could have value. How would all_integer_literals be defined?

Note, Trash, which a toolkit that sits on top of Antlr. trunfold inputs a grammar, and outputs a grammar that unfolds one or more rules. I think we need to draw a distinction between a parser generator, and a tool that refactors grammars.

0 replies

ericvergnaud · 2024-01-02T09:50:25Z

ericvergnaud
Jan 2, 2024
Maintainer Author

I'm beginning to think that this proposal is better addressed by includes.
antlr5 could ship with a set of simple includable lexer grammars that are easy to reuse

2 replies

ftomassetti Jan 2, 2024
Collaborator

Makes sense

KvanTTT Jan 3, 2024
Collaborator

Yes, and different languages can have a bit different definition of literals.

ericvergnaud · 2024-01-03T14:15:55Z

ericvergnaud
Jan 3, 2024
Maintainer Author

Closing this topic in favour of includes

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in macros for common lexer patterns #9

{{title}}

Replies: 7 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Built-in macros for common lexer patterns #9

ericvergnaud Dec 15, 2023 Maintainer

Replies: 7 comments · 3 replies

ftomassetti Dec 22, 2023 Collaborator

kaby76 Dec 22, 2023 Collaborator

ericvergnaud Dec 22, 2023 Maintainer Author

kaby76 Dec 22, 2023 Collaborator

ericvergnaud Dec 22, 2023 Maintainer Author

kaby76 Dec 22, 2023 Collaborator

ericvergnaud Jan 2, 2024 Maintainer Author

ftomassetti Jan 2, 2024 Collaborator

KvanTTT Jan 3, 2024 Collaborator

ericvergnaud Jan 3, 2024 Maintainer Author

ericvergnaud
Dec 15, 2023
Maintainer

Replies: 7 comments 3 replies

ftomassetti
Dec 22, 2023
Collaborator

kaby76
Dec 22, 2023
Collaborator

ericvergnaud
Dec 22, 2023
Maintainer Author

kaby76 Dec 22, 2023
Collaborator

ericvergnaud
Dec 22, 2023
Maintainer Author

kaby76
Dec 22, 2023
Collaborator

ericvergnaud
Jan 2, 2024
Maintainer Author

ftomassetti Jan 2, 2024
Collaborator

KvanTTT Jan 3, 2024
Collaborator

ericvergnaud
Jan 3, 2024
Maintainer Author