Compatibility/Integration with other systems #26

ftomassetti · 2024-01-26T13:43:03Z

ftomassetti
Jan 26, 2024
Collaborator

This is a bit of a broad topic but I would be curious to hear your opinions about this.

In general we all use parsers as components of some applications: a smart editor, a transpiler, a compiler, an interpreter, a code analysis system, etc.

So parsers need to be integrated with these other systems that basically needs to consume parse-trees returned by ANTLR (and possibly a list of issues), need to transform the parse-trees, store the parse-trees, generate stuff out of the parse-trees etc.
Sometimes these systems also need to learn about the structure of the language: here I do not mean the syntax but the structure of the different kinds of nodes and also the list of the different kinds of nodes. What some people would define as meta-model (or M2). When I have this need I either parse the grammar or use reflection to derive the structure of the language from the structure of the Context classes.

This is a problem that I am seeing over and over again with different systems in the language engineering field and one possible answer is this project called LionWeb. In essence the idea is to define a series of formats and protocols for interoperability.

In the case of ANTLR, if ANTLR had a compatibility layer with LionWeb we could benefit from all the infrastructure that there is currently in LionWeb and the one that eventually will be produced:

Take a parse-tree and store it in a model-repository compatible with LionWeb, where we could run queries and processing the parse-trees obtained by parsing entire codebases. This would be very useful for code analysis scenarios that needs to process large codebases
Export to EMF
Export to JetBrains MPS
Load an export parse-tree, using the bindings that there are currently in Java e TS, with the one for C# about to be released
Serialize the tree to JSON

In the future one could also share the effort to build common infrastructure, like for example code-generators using as inputs a LionWeb models or tree-rewriting/model-transformations systems that work on every LionWeb models (here the alternatives is creating a system for tree-rewriting that is specific to ANTLR5).

Besides LionWeb, I am curious to hear if you encountered cases where you wanted to somehow import a parse-tree into some other system, or serialize it, or process it using existing tools and libraries which required you to write some sort of adapters.

I hope this does not sound too confused or broad.

kaby76 · 2024-01-26T15:55:23Z

kaby76
Jan 26, 2024
Collaborator

This is a very important topic. Antlr is a "parser generator", but it needs to work with tools beyond the scope of a "parser generator".

The first thing that has to be clarified is the grammar for Antlr5 itself. This is because nothing can proceed on reading and analyzing grammars without the meta-meta model.

The other thing that should be kept in mind is to think of the parse tree (and hopefully have round-trip inport/export) as a "DOM". This is because this is the standard many tools use. Intertoken text should be placed in the PT as attributes. I would probably add line/column information as well, although this duplicates information that can be derived from the frontier of the PT.

What query language do you intend for LionWeb to support? It doesn't say much.

I have been writing a requirements spec for the next version of Trash, which is command-line grammar toolkit. Among other things, my plan is to include the Z3 SMT theorem prover for analysis of grammars. The other part I intend to do is to use an up-to-date XQuery engine. Trash passes around parse trees in a JSON format, but it includes other information about the char buffer, and minimal information about the grammar used in the parse. Passing around just a parse tree (actually a collection of many) is not enough.

6 replies

kaby76 Jan 29, 2024
Collaborator

The plan for the Z3 tool will be to check for parser rules that cause performance issues, probably ambiguities. I'm still trying to define the CNFs involved. One such anti-pattern is a rule r : ... a* b ...; where a can derive b. There are also a number of scripts that find a number of problems with grammars, over in https://github.com/kaby76/g4-scripts. These are in mostly pure XPath expressions. Unfortunately, XPath is insufficient for things like the above anti-pattern.

KvanTTT Jan 29, 2024
Collaborator

The plan for the Z3 tool will be to check for parser rules that cause performance issues, probably ambiguities.

Interesting topic but using Z3 solver looks overkill for the current task in my opinion. From my current understanding I think it's possible to find "problem" rules using ATN analysing. Could you demonstrate an Z3 query for a simple case?

KvanTTT Jan 29, 2024
Collaborator

This is a very important topic. Antlr is a "parser generator", but it needs to work with tools beyond the scope of a "parser generator".

Completely agree. Honestly, for me this task is more important for me than WebAssembly-related stuff (but I've spent some time on Kotlin target porting because it required not much effort and I consider Kotlin is more suitable than Java). I don't believe WA resolves all problems with performance because there are some problems with ANTLR algorithm itself.

The first thing that has to be clarified is the grammar for Antlr5 itself. This is because nothing can proceed on reading and analyzing grammars without the meta-meta model.

We have such a grammar but in ANTLR 3 format 😄 But I'd say it's over-complicated.

This is because this is the standard many tools use. Intertoken text should be placed in the PT as attributes. I would probably add line/column information as well, although this duplicates information that can be derived from the frontier of the PT.

We can use standard approach here with leading/trailing tokens, see how they are stored in Roslyn for example: how-are-comments-stored-in-the-syntax-tree-and-how-to-use-the-syntax-visualizer.

kaby76 Jan 29, 2024
Collaborator

I haven't proof of concept yet, so that will be the first thing to do. There has been research published on the subject starting about 15 years ago 1 2 3 4. As they pointed out in one of those refs, the main problem is how practical the solution is.

I am getting close to getting my VSCode Antlr extension generator working again. I'm planning on adding that to Trash for Antlr5, which adds categories. The main problem is that everything is a "moving target" (TypeScript, VSCode extension code, C#). I'm playing catch-up.

kaby76 Jan 29, 2024
Collaborator

We can use standard approach here with leading/trailing tokens, see how they are stored in Roslyn for example: how-are-comments-stored-in-the-syntax-tree-and-how-to-use-the-syntax-visualizer.

I attached comments and whitespace to the parent node of a token, not the leaf node (aka "token") itself. The problem is that intertoken text can be associated with the previous or following parse tree node. For example, if we want to unfold a defined by a : b c;, we don't want to grab the leading space before the b when placing the right-hand side in d : a x y z; to get d : (b c) x y z; (not d : ( b c) x y z;). XPath doesn't have a good way to reference attributes before or after a DOM node, so I haven't quite figured out the best representation.

ericvergnaud · 2024-01-29T18:50:57Z

ericvergnaud
Jan 29, 2024
Maintainer

@KvanTTT you are correct that WA will not resolve all performance problems with AntLR. The idea is that it will significantly improve performance with JS/TS and Python. But other optimizations are more than welcome, and thanks to the unified Wasm runtime, they will become available for all targets - at least that's the proposed strategy.

0 replies

kaby76 · 2024-02-03T12:52:56Z

kaby76
Feb 3, 2024
Collaborator

One such anti-pattern is a rule r : ... a* b ...; where a can derive b.

The above pattern has a right-recursion/de-Kleene operator pattern as well in scotty.g4. This pattern causes a large max-k lookahead to resolve when to finish the rule, causing full context fallbacks. Very, very expensive for Antlr. Simple cases are easy to detect with XPath expressions, but more complicated ones that invoke string rewrites are harder.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility/Integration with other systems #26

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Compatibility/Integration with other systems #26

ftomassetti Jan 26, 2024 Collaborator

Replies: 3 comments · 6 replies

kaby76 Jan 26, 2024 Collaborator

kaby76 Jan 29, 2024 Collaborator

KvanTTT Jan 29, 2024 Collaborator

KvanTTT Jan 29, 2024 Collaborator

kaby76 Jan 29, 2024 Collaborator

kaby76 Jan 29, 2024 Collaborator

ericvergnaud Jan 29, 2024 Maintainer

kaby76 Feb 3, 2024 Collaborator

ftomassetti
Jan 26, 2024
Collaborator

Replies: 3 comments 6 replies

kaby76
Jan 26, 2024
Collaborator

kaby76 Jan 29, 2024
Collaborator

KvanTTT Jan 29, 2024
Collaborator

KvanTTT Jan 29, 2024
Collaborator

kaby76 Jan 29, 2024
Collaborator

kaby76 Jan 29, 2024
Collaborator

ericvergnaud
Jan 29, 2024
Maintainer

kaby76
Feb 3, 2024
Collaborator