Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow name-char as first character of unquoted-literal #990

Merged
merged 3 commits into from
Feb 12, 2025

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Jan 27, 2025

Fixes #724

As discussed today, our name definition is a slightly restricted variant of the XML NCName, which has not been updated to account for developments in the last 15 years or so that have e.g. included combining marks and the ALM in the name-start range.

So as the name production is already messy, unquoted-literal is not really made much worse by allowing name-char as its first character. Doing so also allows us to drop the number-literal rule from the syntax.

The number-literal rule is still needed for number functions, hence its re-definition in the Number Operands section. Note that I've left out the exponenetial part from it, as the + character is no longer valid without quoting.

This change also removes the need to quote 2-digit, a valid literal value for some :datetime options. Previously, it was the only default function option value that needed quoting.

Filing initially as a draft, as I want to implement this to make sure it works as expected, and to make sure that all test suite changes have been accounted for.

@@ -45,7 +45,7 @@
"src": "{|2006-01-02T15:04:06| :datetime}"
},
{
"src": "{|2006-01-02T15:04:06| :datetime year=numeric month=|2-digit|}"
"src": "{|2006-01-02T15:04:06| :datetime year=numeric month=2-digit}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change the parsers now need a pretty big lookahead.

Before you it was enough to look at the first character:
| => quoted string
0-9 or '-' > try to get a number literal
starting char => literal
anything else => Error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that change? Anywhere that literal is valid, single-character lookahead still suffices.

  • | → quoted-literal
  • $ → variable
  • : → function
  • * → key
  • # or / → (markup)
  • name-char → unquoted-literal
  • anything else → error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gibson042.

It is trivial to check the first character. A number literal can only start with one of 11 characters: - or 0..9. All of these are contained in name-char:

name-char  = name-start / DIGIT / "-" / "."
           / %xB7 / %x300-36F / %x203F-2040

So the first character is enough to send you down the path.

Of course, once you start down the path to any of these outcomes, you could end up with something bogus


```abnf
literal = quoted-literal / unquoted-literal
quoted-literal = "|" *(quoted-char / escaped-char) "|"
unquoted-literal = name / number-literal
number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "+"] 1*DIGIT]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another step towards making everything a string :-(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that comment is a confusion about thinking that quoted literal mean string, when they are completely separate.

The main syntax shouldn't talk about what format literal numbers are, or what format literal dates are, or what format literal units are; that's up to the functions.

Copy link
Collaborator

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number-literal rule is still needed for number functions, hence its re-definition in the Number Operands section. Note that I've left out the exponenetial part from it, as the + character is no longer valid without quoting.

This seems like a rather large drawback... don't we still want to allow something like {|1e+12| :currency currency=JPY}? Or for that matter, just allow + to be unquoted like e.g. unquoted-literal = 1*(name-char / "+" / …).

This change also removes the need to quote 2-digit, a valid literal value for some :datetime options. Previously, it was the only default function option value that needed quoting.

That's nice.

@aphillips
Copy link
Member

The number-literal rule is still needed for number functions, hence its re-definition in the Number Operands section. Note that I've left out the exponenetial part from it, as the + character is no longer valid without quoting.

This seems like a rather large drawback... don't we still want to allow something like {|1e+12| :currency currency=JPY}? Or for that matter, just allow + to be unquoted like e.g. unquoted-literal = 1*(name-char / "+" / …).

We've... stalled on expanding number literals to non-integer values. I think this is something that will be wanted, but attempts to address this so far have not succeeded. I do hope that they will in the future.

Note that the number-literal rule doesn't have to appear in the ABNF if the name rule or one of the literal rules already covers it. It is convenient that it is used in both, but it would be equally valid to just have the number.md define number literals in the section in digit size options. It might even be better to do that, so that other functions could follow that model for subsetting names (cf. @macchiati's discussion of valid vs well-formed).

So, yeah, we could consider allowing the ASCII character + in an unquoted.

Is the meta-goal to allow any literal that does not contain whitespace, bidi formatting characters, or syntax-meaningful "sigils" to be unquoted? If that's the case, maybe we restore a small production for reserved starter characters?

@eemeli
Copy link
Collaborator Author

eemeli commented Feb 12, 2025

This seems like a rather large drawback... don't we still want to allow something like {|1e+12| :currency currency=JPY}? Or for that matter, just allow + to be unquoted like e.g. unquoted-literal = 1*(name-char / "+" / …).

Is there really that much utility in supporting literal numerical values with exponents? I would claim that this is exceedingly rare, and could be left out initially. If someone does present a real-world use case, we could extend support later for + in unquoted-literal and for exponents in number-literal.

@mihnita
Copy link
Collaborator

mihnita commented Feb 12, 2025

How does that change? Anywhere that literal is valid, single-character lookahead still suffices.
...
It is trivial to check the first character.

It needs more than one if I still want to keep the number-literal:
12345678.9 => number literal
12345678Z9 => string literal

But it looks like there is no appetite for that.

I think it is a mistake, but it also looks like I am in minority.

@aphillips
Copy link
Member

@eemeli mentions:

I would claim that this is exceedingly rare, and could be left out initially. If someone does present a real-world use case, we could extend support later for + in unquoted-literal and for exponents in number-literal.

I agree that it's rare and note that we have already left it out. However, it would be good if we had enough foresight to minimize changes of this sort to the syntax in the future. Allowing + into unquoted does not mean that we "support exponents". It means that labels such as goofy+lingonberry or 1e+03 are valid unquoted strings.

@mihnita noted:

It needs more than one if I still want to keep the number-literal:
12345678.9 => number literal
12345678Z9 => string literal

I think this is a good callout, from the point of view that the current syntax suggests we accept name or number-literal tokens, when the intention is that number-literal expands the set of starter characters. Elsewhere, in @macchiati's proposal, we do away with this (note that 12345678Z9 is not a valid name, even though we all seem to be presuming that it is a valid unquoted!!)

I think that moving to a strictly character production would allow single-lookahead to capture unquoted (good efficiency) without incorrectly suggesting that the processor capture a number or a name. Interpretation of a given option or key value as a number by a functional handler is still permitted.

@gibson042
Copy link
Collaborator

Is there really that much utility in supporting literal numerical values with exponents? I would claim that this is exceedingly rare, and could be left out initially. If someone does present a real-world use case, we could extend support later for + in unquoted-literal and for exponents in number-literal.

Exponential notation seems important to me for big numbers and small numbers, such as are encountered in finance and physics. It's also present in most if not all of the languages in which one can already NumberFormat (cf. MDN numberFormat.format(`${bigNum}E-6`)), and required in :number output for notation "scientific" and "engineering".

Given all that, I would ask what is the benefit of not supporting exponential notation in Number Operands? If it is just "it should be possible to express any number operand as an unquoted literal", then I think the same arguments apply to "+" for exponents as applied to "-" for "2-digit". But for the record, I don't consider quoting to be particularly special or onerous, and would be fine if it is required for |1.416784e+32| but not for 1.416784e32 or 1.616255e-35.

I agree that it's rare and note that we have already left it out.

@aphillips what do you mean by "have already left it out"? Exponential notation is certainly a part of the current number-literal, which is documented to match JSON number.

note that 12345678Z9 is not a valid name, even though we all seem to be presuming that it is a valid unquoted!!

Well, yeah. Numeric literals were never valid names, and expanding unquoted-literal to remove the first-character constraints inherently includes such garden-path cases.

@eemeli
Copy link
Collaborator Author

eemeli commented Feb 12, 2025

It needs more than one if I still want to keep the number-literal:
12345678.9 => number literal
12345678Z9 => string literal

As this PR makes it so that a number-literal is only valid as an operand of a numerical function, I don't see how it'd be possible to write a parser that didn't initially parse a literal operand as an unquoted-literal or quoted-literal, given how the function is after the operand.

Pre-emptively parsing the operand of {1.0 :string} as a number would definitely be a mistake.

@aphillips
Copy link
Member

@gibson042 asked:

what do you mean by "have already left it out"?

I mean that, while number-literal includes it, we don't normatively require implementations to parse anything except integers (for digit size option in number.md). Implementations can, of course, interpret a number literal as a number (including exponentiation), but that's their business.

spec/functions/number.md Outdated Show resolved Hide resolved
spec/message.abnf Show resolved Hide resolved
@eemeli
Copy link
Collaborator Author

eemeli commented Feb 12, 2025

As discussed and requested, I've re-included exponents in number-literal.

@aphillips
Copy link
Member

@eemeli Ready for review?

@eemeli eemeli marked this pull request as ready for review February 12, 2025 19:56
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very slight nit, otherwise LGTM

spec/syntax.md Outdated Show resolved Hide resolved
Co-authored-by: Addison Phillips <[email protected]>
@macchiati
Copy link
Member

It'd be nice to get this merged so I can have the other literal PR based off of head.

@aphillips
Copy link
Member

Based on previous discussion in the 2025-02-10 call, merging this change.

@aphillips aphillips merged commit 3e9cb6d into main Feb 12, 2025
2 checks passed
@aphillips aphillips deleted the free-the-literal branch February 12, 2025 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEEDBACK] Message Format Unquoted Literals
5 participants