Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce remove ambiguous grammar RFC #1

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions text/0001-remove_ambiguous_grammar.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# SurrealQL: Remove ambiguity in grammar

## Glossary

- Production: A branch or leaf of the parse tree.
- Identifier: Names used in code which don't belong to the structure of a statement.
- Raw Identifier: A identifier without being surrounded by ` or `⟨ ⟩`
- Keyword: Any identifier like production which belongs to the structure of a statement.
- Reserved word: A keyword which can’t be used as a identifier.
- Surrounded Identifier: An identifier which is surrounded by a delimiter, allowing otherwise disallowed identifiers to be used. Example: ``foo\\nbar``
- Strand: A string like production i.e. something like `"hello world"`

# 1. Summary

The current version of SurrealQL grammar, as defined by what the parser currently accepts, contains several ambiguous productions. These productions are parsed differently depending on the context or can seem very similar to each other, but subtle differences can result in completely different semantics.
DelSkayn marked this conversation as resolved.
Show resolved Hide resolved

These ambiguities in the grammar complicate parser design, limit possible future extensions to SurrealQL that don't break, and could be confusing when using the language.

This RFC proposes several changes to the grammar to limit the present ambiguity:

- Disallow a value from starting with an identifier which could, in its current position in the grammar, start a statement as a keyword.
- Require that raw identifiers don't start with a digit.
- Introduce strand prefixes for the specific types of strands
- Introduce a syntax error for block record-id object ambiguity
- Change the KNN operator from `<3>` to `knn<3>` or similar


# 2. Motivation

What motivates this proposal and why is it important?

## Context

The first implementation of the SurrealQL parser was implemented in Nom. Nom is a useful library for building a parser; however, its flexibility can also be a downside. Nom functions combine to form a tree of smaller sub-parsers, which together define a full parser. In Nom, whenever the grammar to be parsed contains a branch, Nom decides which branch to follow by taking whichever branch first parses without error. For example, see the following source code:

```
a - 1
```

and the following parser:

```jsx
alt((
tuple((ident, char('+'), number)),
tuple((ident, char('-'), number))
))
```

In this case Nom will first try to parse the first branch, fail when it encounters `char('+')` sub-parser and then back up to try the second example. Nom imposes no limits on when and how far the parser will back up whenever an error happens, only failing when it has completed a full sweep of the entire parse tree.

This ability to backup and try again allows one to produce quite powerful parsers by composing simple functions. Nom is able to parse basically any grammar if sometimes with exponential complexity.

SurrealQL was build with this parser and the flexibility of the parser can be noticed in the current flexibility of the language. The SurrealQL grammer uses backup to define a grammar which has ambiguity which is only resolved by the order by which the parser parses a production.

## Flexibility vs Clear semantic meaning.

The current grammar flexibility does allow for a more expressive syntax: more code will parse without errors, and you will often need fewer characters to write a query than if the language were more rigid.

However, I feel that this flexibility is limiting other goals that I would consider more desirable, such as the ability to precisely understand the semantics of a query. Currently, a full understanding of what a query exactly means requires quite deep knowledge of how the parser functions, not just if a token is allowed, but also what other possible tokens are allowed and how the parser will retry parsing when a parse branch fails to match.

To illustrate I have two examples.

Consider this first example:

```sql
if CONTAINS { 3 }
```

The above source is completely ambiguous and can both be parsed as an `IF` statement with as condition `CONTAINS` and as its `THEN` block `{ 3 }`, or as an expression where we check if the identifier `if` contains `{ 3 }`. The current parser will parse it as an if statement only because that is the first branch in the parser which matches.

This could result in confusion, consider the following code:

```sql
DELETE some_data FROM values WHERE if CONTAINS true
```

The above query checks for deletion if the identifier `if` contains true. It is not parsed as an `IF` statement since the parser fails when trying to parse a code block when it finds the `true` token, and therefore falls back to parsing a the production as a value.

Now suppose the user wanted to make a change and add a expression block:

```sql
DELETE some_data FROM values WHERE if CONTAINS { true }
```

Now the statement is suddenly parsed as an IF statement and, if the table happens to have a `CONTAINS` field, it will still run but with a completely different meaning.

A different type of ambiguity in today's grammar is token ambiguity.
Most languages try to limit the possible meaning of tokens as much as possible.
As apposed to most languages in SurrealQL raw Identifiers can start with a digit. This means that a number is a valid identifier, and a number with a fraction could also be parsed as a identifier and its field. This can make it hard to determine what is an identifier. Consider the following example:

```sql
UPDATE table SET 10dec = "a value" WHERE 10dec = "a different value";
```

In the above query the field `10dec` will be updated to the value "a value" if the number 10 is equal to "a different value". In SET you are not allowed to have any numbers so `10dec` which is a valid identifier will be parsed as an identifier. In the where clause however numbers are allowed so `10dec` will in that position be parsed as number.

More examples of these types of confusing grammar can probably be found.

## Hard to specify

As we are working towards a production ready database we should probably eventually create an official specification for the SurrealQL language. Most languages are either Context-Free or almost Context-Free and therefore allow specifying the grammar in Context-Free grammar. See for example the JavaScript spec: [https://tc39.es/ecma262/#prod-Statement](https://tc39.es/ecma262/#prod-Statement)

The current ambiguities in the language make it impossible to specify a grammar for SurrealQL in the same way.

## Extendibility

Because a lot of the current syntax is defined in part by the failure to parse other productions, productions which today are parsed one why could in the future silently change to have a completely different semantic meaning.

Consider the following query:

```sql
SELECT count FROM table GROUP count
```

Currently the values from this select will be grouped by the field key. But where we to every introduce some feature which would require the `COUNT` keyword after `GROUP` then this code will quietly change meaning.
DelSkayn marked this conversation as resolved.
Show resolved Hide resolved

# 3. Proposal

## Location specific reserved words.

The first change proposed is to disallow raw identifiers in places where the same identifier could also be parsed as a keyword. For example, the following code would no longer be allowed:
Copy link

@andar1an andar1an Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be more simple to just disallow certain keywords, and not allow for it to be wrapped? Would users be very upset by being unable to use a few keywords? In my case, if I saw an error message saying "can't use this keyword", I would just change it to something else. But I am only 1 person so not sure.


```sql
use = 32;
```

`use` in this case could be the start of a USE statement in this position and would therefore not be allowed. In order to get the same code one would need to surround the identifier.

```sql
`use` = 32;
```

Another example would be:

```sql
SELECT field FROM use;
```

The part after `FROM` can also have statements, however the USE statement is disallowed in subqueries, therefore allowing `use` would not be a problem.

This change would completely remove existing ambiguity wherever an expression could possibly be parsed as a statement. It would simplify the parser a great deal. This changes also does not require that raw identifiers are completely distinct.

## Disallow raw identifiers to start with a digit

This is a common way to distinguish between a digit and number. By disallowing the first character from being a digit we can be sure that Identifiers aren’t numbers. This would resolve the same text from being used as an identifier or a number depending on the place in the grammar.

## Introduce strand prefixes

The only way can currently be distinguished is if one strand fails to parse as any of the other types. This can lead to problems where a use wanted a certain value to be a plain strand but it happened to match another type of strand and is thus converted into that specific type.

And example is `"5:00"` which brought up in an issue. The user wanted to store this value as a plain strand but it happened to match a thing strand.

I propose we introduce specific strand prefixes for specific strand types:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: will single quotes or double quotes matter? I notice uuid uses single quotes and the other 2 examples use double. Assuming either will be acceptable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes either is acceptable, the only difference is the u or t in front of the string.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ULIDs?? Taking a glance at SurrealDB's code, it seems that they are handled quite differently than UUIDs. Are they stored differently as well?

Also, I'm apprehensive about the record strand prefix.
I get it for datetime and UUIDs, but for solving the ambiguity between record IDs and object fields, I think the object syntax should change.

I think the way to go is to make objects have fields that are string literals.
(In the future there could be format strings, which would allow for dynamically setting the field name)

So, for your example:

{ a:b.c }

Is taking the field c of a record a:b.
To express an object with field a with a value of b.c, you would instead write:

{ "a":b.c }

This also mirrors the rules in JSON.


- `t"2012-05-26T12-00-00Z"` for date-time strands
- `u'7c3f4ce8-83c4-458f-b1d7-b28352dea93c'` for UUID’s
- `r"5:00"` for record id strings

## Syntax Error for Block-RecordId Object ambiguity

The following statement is ambiguous:

```sql
{ a:b.c }
```

This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a minor ambiguity? It seems like you can get completely different results from the same syntax? Or am I understanding wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it minor, because I don't think there are a lot of situation in which this ambiguity happens, but you are correct that you can get completely different results for the same syntax.


Therefore, I propose that if the parser encounters `{` `Identifier` `:`, it raises a syntax error to notify the user of the ambiguity. The user would then need to either use a record ID strand if they intended to create a block, or make the identifier a strand if they intended to create an object.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To visualize, would it be like:

{ r"some_record_id":b.c  }

or

{ a: u"some_uuid" }

Copy link
Member Author

@DelSkayn DelSkayn Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ambiguity is unrelated to the ambiguity of strands.

This is a ambiguity is caused by the fact that raw records are allowed. you can for example write CREATE some_table:some_specific_id;. The problem is the :. If a raw record is the first statement in an block statement it looks exactly like an object.

In the example { a:b.c }, a:b looks like a raw record but it could also be a field definition with a value.

Raw records will still be allowed in these proposed changes, so we need a way to define the value of { a:b.c }. Currently the parse will first try to match an object and then try to match a block statement. This results in strange behavior like {a:b.c; true } being parsed as an block but {a:b.c } is parsed as an object.

The proposed change here is to define that if the parsers sees { followed by and identifier followed by : it will always choose to parse an object and error if it encounters {a:b.c; true }. To make this statement work again you could use a record strand: { r"a:b".c; true }.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I see now. So the last code snippet at the end of your comment was what I was trying to understand.

So the 2 options you mentioned in the rfc would be visualized as { r"a:b".c; true } or { a:r"b.c"; true }?


# 4. Drawbacks

- Some queries now require more type and syntax than they previously did.
- These changes break existing code.
- These new changes could be more difficuly to learn initially as it introduces more syntax rules to remember.
Copy link

@andar1an andar1an Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If error messaging is concise and explicit as you proposed, I don't think this would be too bad to learn, vs. hitting what seem like common ambiguity problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! but it is still kind of a drawback.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not arguing there, just commenting that in my personal opinion I think the benefit would outweigh the drawback :).


# 5. Alternatives

## Limited reserved word list

Instead of location specific keywords we could instead specify a limited list of keywords which are disallowed. Keywords which can't start a statement like `EVENT` or `TABLE` would still be allowed, but `USE` would be disallowed everywhere.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ - I personally prefer this to #3. I should have kept reading before leaving that comment! haha


This would probably be easier to communicate then location specific keywords

# 6. Potential Impacts

The proposed changes are breaking changes. They will probably break existing code, if a limited set. As we have guaranteed stability it introducing these changes as the default would require releasing a new major version. Therefore I propose that the new parser will implement this syntax and introduce the new parser as a new experimental feature which in the future will possibly become the default.

# 7. Unresolved Questions

## What do we change the KNN operator to?

The `knn<3>` operator seems a bit verbose and rather complex looking for an operator.
Is this the right syntax.

## How do we differentiate an plain object from a geometry object?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very curious about this. I am not too sure what is currently considered a plain object and a geometry object. Are there any examples?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see a Geometry example at the bottom of page.

Is there potential that one may want to do topological analysis on plain objects like vectors or matrices, because then maybe with Geometry as a supertype one can share topology methods with plain objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geometry objects are objects which have a very specific layout. The following query returns one kind of geometry object.

RETURN { 
    type: 'LineString',
    coordinates: [1,2,3,4]
}

As you can see it looks exactly like an normal object but if you change even a small part like for example:

RETURN { 
    type: 'LinesString',
    coordinates: [1,2,3,4]
}

(Lines instead of Line) it is suddenly a plain object and no longer a geometry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shall add this example to the rfc

Copy link

@andar1an andar1an Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think considering making the plain object the supertype may also be an option: one may be able to draw some tangents from geometry kernels and languages like apt for example: https://en.wikipedia.org/wiki/APT_(programming_language).

At the end of the day Geometry is comprised of plain objects. And this way plain objects are limited to methods that work for them, and geometry objects can be expanded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any semantic differences between geometry objects and plain objects (like operations)?
If not, then maybe the best thing to do is to treat geometries like plain objects by default and allow for defining the field as otherwise.
This would mean that an application like Surrealist can get information about the table to know how to display those fields.
We can even copy the record<table> syntax as object<schema>. There is even the potential to allow for user-defined schemas to validate against using a URI (maybe like object<"http://example.org/bibliography.schema.json">).

So for a geometry object this might look like:
DEFINE FIELD position ON TABLE location_history TYPE object<geometry>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, guess there are operators specialized for geometry objects. Maybe it would be better to move these operators to functions.
For example:

(-0.118092, 51.509865) INSIDE {
    type: "Polygon",
    coordinates: [[
        [-0.38314819, 51.37692386], [0.1785278, 51.37692386],
        [0.1785278, 51.61460570], [-0.38314819, 51.61460570],
        [-0.38314819, 51.37692386]
    ]]
};

Could be:

geometry::inside((-0.118092, 51.509865), {
    type: "Polygon",
    coordinates: [[
        [-0.38314819, 51.37692386], [0.1785278, 51.37692386],
        [0.1785278, 51.61460570], [-0.38314819, 51.61460570],
        [-0.38314819, 51.37692386]
    ]]
});

Then the INSIDE operator could be generalized to plain objects. For example:

{ "foo": "bar" } INSIDE { "foo: "bar", "biz": "fiz" }


Geometry objects look exactly like plain objects but are handled differently internally.
There are operations which are possible on a plain object which you cant do on a geometry object, like, for example, adding a new field.
Should we resolve this in the syntax itself by assigning different syntax to geometry objects or
should we instead allow geometry objects to behave like a strict super type of plain objects,
i.e. making sure all operations which are allowed normal objects are also allowed on geometry objects?

## Addendum: List of ambiguities

The following is a (uncomplete) list of ambiguous SurrealQL statements
that currently exist in the syntax.

- KNN operator VS relation operator.

```sql
RETURN 1<2>3
```

- Assignment or equality

```sql
a = b
```

- Object or block statement with a thing

```sql
{ a:b.c }
```

- Number or operation on identifiers

```sql
123e+123

123e-123
```

- Outputs keyword or field:

```sql
CREATE foo SET before = 3 RETURN before
```

- Geometry or Object
```sql
{ kind: "LineString", coordinates; [1,2,3,4] }
```