-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all emojis work as bare keys #954
Comments
I did a quick check, and 179 emojis currently fail (the other 1530 work); here's a list: https://gist.github.com/arp242/a3b99e52c9dea2b6e2d6217aab490ad3 (that's based on Unicode 14, not 15, so there may be a few more – I need to update my tool to 15). Also, the variation selectors (U+FE0F in the example above) are a right pain; these are pretty much invisible in most editors. These should be excluded together with all the RTL stuff (which already are). |
You use ZWJ for creating the emoji. While this is fine, the overlaid code point itself is not in the proper range. We’ve looked at more complex ranges, but decided against it for the added complexity it brings. There will always be some ranges of code points people may feel are missing. Using ZWJ you can ‘invent’ emojis or other characters. I agree it is somewhat unfortunate that certain combinations are currently not possible. But keep in mind that we’re talking about code point ranges, not about characters. And what you’re describing is allowing certain characters, which is an avenue we’re trying to avoid. |
That said, it’s possibly an oversight, as I don’t see anything in 2600-26FF that need be illegal. We’d have to look a little bit closer to the wider range you mention and the discussion or commit log to find out whether we did this deliberately (and then reassess whether that conclusion is still valid) or it was an honest mistake in the added ranges. I tried to be meticulous, but hey, we’re only human ;). Keep in mind that there’s also the argument that we don’t want to over-complicate the ranges. We try to be inclusive, and mainly ban ‘unsuitable’ ranges, while including the rest. |
Yes, the current check I need to do is:
Which doesn't exactly fill me with joy. But I'd rather have one somewhat ugly "wtf?!" function rather than silly stuff like "😗 works but |
I found out what the original motivation was: Basically, we accept letter-like code points. Dingbats, mathematical symbols and box drawing code points aren’t ‘letter-like’. Neither are emojis, of course. But the ranges of emojis that are allowed have been added to later versions of Unicode and belong to “other languages that weren’t previously assigned”. As such, they belong to “be liberal in what to accept from future versions of Unicode”. Perhaps the right cause of action would’ve been to exclude other non letter-like ranges from later versions. However, that brought about another downside: that ID tokens in HTML and XML would not be valid unquoted names. Several RFCs overlap with the current definition. While this is not necessarily a goal for TOML, it has its benefits. With the arguments in the mentioned thread, I still think we’re on the right track here, using the ‘letter-like’ definition of the most widely implemented and used Unicode version (I believe that’s 5 or 6, at least .NET Framework and Windows prior to v11 (or v10?) use 5.x.). Of course, we could allow more tokens that aren’t allowed elsewhere. Or disallow more tokens that aren’t disallowed elsewhere. This would take us further away from widely established identifier definitions, but we may choose to go down that path. |
TOML already allows almost everything as quoted keys, so I think this doesn't matter at all. Directly using TOML keys in HTML, XML, or pretty much anywhere else without processing is already something you can't do. Looking at some other environments, there doesn't seem that much consensus in the first place:
Note sure what other languages/formats support Unicode identifiers from the top of my head. Going back to basics, the goals I'd set would be:
In that sense, "support emojis" is out of scope IMO; I don't think it would be horrible to lose support for it especially since you can still use them inside quoted keys. BUT having ~90% of the emojis work fine and ~10% not work is a bug IMO, especially since an emoji is explicitly included as an example. It's probably better to allow too little and then expand on that later if there's a demand for it. Once we allow something we can never take it back because that would break compatibility. And there's also #941; we need something for that to address "minimize potential for confusion", and tweaking (i.e. limiting) the set of allowed codepoints is one possible way to address that. |
TBH, it looks like we should align with Unicode TR31 syntax, rather than trying to come up with something else. It's what Go, Rust and Python seem to be doing (IIUC), and I think that might just be a more "obvious" way to achieve what we want to achieve here. |
The only possibly problems I see in this range are the Eight Trigrams (☰ ☱ ☲ ☳ ☴ ☵ ☶ ☷) and various symbols related to yin and yang (⚊ ⚋ ⚌ ⚍ ⚎ ⚏). Some of these, especially ⚌, look very much like the equals sign (=), therefore it might be a good idea to avoid them in unquoted keys to prevent possible confusion. One idea: shorten the forbidden range to from U+2630 (☰) to U+268F (⚏). However, unfortunately in this shorter range there are still some very popular symbols (e.g. ☺ ♀ ♂) whose non-allowance could remain confusing. Another idea: actually the Eight Trigrams are probably OK, since they all have three lines rather than the two of the equals sign. Two of the yin and yangs symbols (⚊ ⚋) should be fine too, since they look similar to the underscore, which is already allowed. So we could just forbid U+268C to U+268F (⚌ ⚍ ⚎ ⚏), allowing everything else in that range. |
We previously decided against it, because it’s complex and, iirc, relies on categories. It’s likely (but I’d have to check) that it isn’t compatible with what currently have (apart from already allowing starting with a digit). Edit: the TR31 set is very disjoint to what we have:
The two biggest issues: it uses categories, and letter-like only. Miscellaneous Symbols, which we explicitly include, are forbidden. Also, categories are dependent on Unicode version, which we try to avoid. The full range, for any supported Unicode version, is rather complex. In the previous thread, there’s a comment that shows how complex, and we all kinda sighed with relieve that in the end it wasn’t necessary to go that route. |
If we really want to include this range, we can do the same as we did for the Greek Question Mark (which looks like a semi colon) and just forbid only the Yin & Yang sign that looks like the equal sign.
That’s a good point.
Agreed. Which is what we support. Miscellaneous Symbols do not fit that description, but I see your other point that it’s a little confusing that some ranges are currently excluded. I’m not against including it (the range in the OP). However, I’m a little afraid that every few months we’re going to open this up again because some person’s favourite symbol isn’t allowed. I maybe wrong about this, of course, perhaps this is the ‘last missing range’. We’ve spent many months coming to the current range, at some point we’d just have to settle and call it a day ;). |
@abelbraaksma: Yeah, just forbidding |
@abelbraaksma @ChristianSi I would much rather prefer that we include the Miscellaneous Character block but exclude the two-line yin and yang symbols U+268C to U+268F, as previously suggested, due to their resemblance to the equals sign. unquoted-key-char =/ %x2600-268B / %x2690-26FF ; include Miscellaneous Symbols, but exclude symbols resembling an equals sign |
There are some other syntax-like homographs too:
That's from a quick visual inspection; not a full list. There's some more in the "Halfwidth and Fullwidth Forms" and "Small Form Variants" blocks in particular. |
That’s an interesting list, but i don’t think we should try to be exhaustive here. There’ll always be certain glyphs that look confusing. Put in ZWJ and you cancreate any glyph, from smaller components. |
Maybe we shouldn't allow ZWJ? I've been going back-and-forth on what to do about all of this. While the original issue is "Not all emojis work as bare keys", this ties in to other issues as well and there are knock-on effects. We already allow almost everything as quoted keys. In hindsight, I think this was a mistake, but we can't change that now, and people don't use quoted keys that much since it's annoying to type (many TOML users probably don't even know you can use it) so it's less of an issue in the real world. With bare keys, people will actually start using all the stuff that's allowed. I'm worried that allowing too much will lead to confusion. Homoglyphs are actually not something I'm very worried about since no reasonable person would use "# trollolol = 1" (U+FF03, not a "real" hash) other than maybe as a practical joke on your coworkers. No one really enters these things by accident. Other things that are explicitly excluded now like the multiplication sign (×) isn't that much of an issue either; it's very similar to the letter "x", but no one enters "×" by accident when they intended to write "x". I think it's fine to allow TOML users to do "stupid things", and it's okay to rely on TOML users being reasonably sane. What I am worried about are "invisible" characters such as ZWJ, variation selectors, combining characters, and things like that. All of this is very non-obvious, and easy to get confused by, even for people well versed in how all of this works (i.e. you and me). So while "# trollolol = 1" is certainly confusing, it's not really an issue that crops up in the real world. Same with U+268C-U+268F. I think this is almost a philosophical issue: "if a tree in a forest is confusing but no one sees the tree being confusing, then is it really confusing?" So, back to ZWJ: if we disallow ZWJ lots of emojis won't work, and to be consistent we'd have to disallow at least the commonly used emojis like 😂 and whatnot, which would make the codepoint range a bit more complex. However, in general, I'd say:
So I'd say we probably shouldn't allow ZWJ, and variation selectors, and combining characters, and perhaps a few other things. Those are things that will lead to confusion, unlike ⚌, #, and whatnot. I don't actually care all that much about those because I don't expect anyone will be confused by it in real-world scenarios. |
ZWJ is used in many scripts to create valid characters, glyphs and words. It’s not exclusive to emojis. I don’t think having it is an issue. More the opposite. The whole idea here is to be inclusive wrt languages and scripts. The side effect of this approach is that some emojis also work, because they are in codepoint ranges not explicitly excluded, mainly because these ranges weren’t assigned to in older versions of Unicode. By en large this should be fine. Identifiers will typically be expressed in someone’s native language, script, or a common language like English, Arabic or Spanish. The need for dingbats or emojis is likely comparatively small. I’ve no problem keeping the status quo, or adding other ranges, but whatever we do, there’ll always be new codepoints assigned and they may or may not contain non letter-like characters. These will always be in the already allowed ranges and therefore we cannot exclude pre-emptively. |
I'm fine with including this range except for U+268C to U+268F ( Also I urge not to reopen the rest of the discussion about the allowed ranges. We have found a solution that allows unquoted keys in essentially any script, without burdening implementors with too much complexity. That's good, so we should just keep it that way! |
You're right, I should have addressed that. TR31 has quite a bit of special handling for it, and the way I read it even allows excluding it. Go and C# outright disallow using it. I can see how allowing ZWJ makes sense. It's not entirely clear to me if it's needed to correctly write these languages though, or if it's optional. My thinking is "better to include too little and correct that if need be". Variation selectors are still an issue though. I don't think there's any good reason to include them, they are commonly inserted, and very invisible. And combining characters introduce a lot of ambiguity in string equivalence, as brought up in #941. The more I think about it, the more I feel we should do our best to reduce the potential for ambiguity as this would at least reduce potential for confusion, and the need for NFC normalisation and a Unicode library (like ICU). Perhaps we can't entirely eliminate it 100%, but just covering the common cases would already go a long way.
None of these specific issues were brought up before, as far as I've seen. You can disagree it's an issue, and that is of course fine, but I'd never dismiss anyone like that. |
@arp242: Unicode normalization issues are irrelevant here, since they apply to quoted and unquoted keys in exactly the same way. In quoted keys, arbitrary Unicode is allowed and, of course, that's not going away – in fact, it can't go away since that would break backward compatibility. And let's not roll back on our promise that "you can use unquoted keys representing words from arbitrary languages", which we have realized in the current state. In this regard, I found Some trivial knowledge about Unicode a good read. My take from there: variant selectors are needed, at least, to wrote Mongolian correctly. In that article, the usage of ZWJ is mostly limited to emojis, but from Wikipedia I get that it is needed to render text in various scripts (e.g. Arabic or Indic ) correctly. You're right that the text will likely still be readable without this information (I guess?) but it'll look "broken" to people. Also relevant is that text editors will likely auto-insert these ZWJs where needed. So when we tell people "you can use bare keys in Arabic script, but only without ZWJs", this might well cause all kind of parsing errors, since people will have a hard time writing keys in these scripts without this character appearing. So yes, while you're right that it makes sense to discuss whether this character and the variant selector code block should remain included, I'd still tend to say that yes, they should. |
I agree, they should. Better to be liberal in what you accept, esp when it comes to scripts in Unicode.
Wrt C#, this is only partially true. Identifiers in Common IL can be any codepoint, except a small handful, like NULL and FFEF, I believe. In F#, this rule is applied very liberally, and you can create identifiers in the full range Common IL allows. In C#, calling such identifiers requires a little extra work, but is still possible. Let's not start limiting more. Either expand the ranges, or leave it as is. From the discussion above, I think the conclusion would lean towards inclusion of the extra range, as mentioned here: #954 (comment) |
That is correct, but as I mentioned before quoted keys aren't used all that much, so practically it's much less of an issue with quoted keys. That we need be a bit more careful with bare keys is not controversial, otherwise we would just allow everything except
Yeah, maybe; it's really hard for me to judge to what degree it's "needed" and "commonly used" and to what degree it's "a feature offered, but not commonly used". I loaded the Arabic Wikipedia on Mars (just a random featured/long article), and it seems to contain only a single ZWJ (in the That said, Go != TOML and the context for both is different, and TR-31 contains special rules for handling ZWJ, and I suppose this is an argument for both sides here: "ZWJ is needed in some contexts, so it must be allowed" as well as "ZWJ can be confusing, so we need to restrict where it can appear". In conclusion: further research needed if it's decided to spend effort on this in the first place. The same applies to variation selectors.
Yeah, I don't really agree with that. "Postel's law" has been widely criticized over the years and I'm hardly the first/only to disagree with it; I'd say it's fair to state that it's pretty controversial overall. It was framed in a very different context, and in a very different world; historically it made a bit more sense due to standards often written up after the implementations, unclear/underspecified standards, harder to actually read the standards so many didn't, "cowboy coding" being the norm, etc. much of that applies a lot less today, and IMHO it doesn't apply to TOML or Unicode. But my main issue with this is: it doesn't really engage with my concern, which can be summarized as "I feel this has the potential to cause a great deal of confusion, so I think it's better to be conservative initially, and perhaps correct it later if need be". If you want to say "I don't think people will end up being confused" or "I think it's an okay trade-off that people will get confused" then fair enough, as that engages with the stated concerns. But this doesn't really.
To be honest, I'd really like some other views on this as well; thus far only three people commented on this. I realize you might think I'm stubborn and difficult here, but I promise you I'm really not trying to be. I spent a lot of time looking at this over the last few days (which also included considering "is it really worth everyone's time and energy banging on about this?"), and I think this has a huge potential to bite us and people using TOML in the ass. If it was only a matter of "I think doing it like this is nicer" or "I don't like it" I wouldn't have cared to much; I don't like to bikeshed over details and generally "whatever works as long as it's not completely atrocious" is fine with me. Either way, I probably said everything I wanted to say, so I'll leave it at that for a while, giving other people a chance to catch up, reply, vote, etc. |
Posting as a separate comment for votes, I think the core questions are essentially:
Point "2" has a lot of subpoints, but if the answer to "1" is a "no" then it's pointless to even discuss it. People can vote on this comment (not using thumbs to avoid ambiguity):
|
Yeah, but that sword has two edges: it’s similarly confusing if certain names cannot be expressed. If someone wants to use ZWJ, they will typically know what they’re doing. The majority of people will stay away from it, simply because it’s never come up with naming identifiers. Quote:
|
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and ANY solution is a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is the strongest argument in favour of this and the biggest improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative is probably the right way forward. - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work", but "this character works fine, but this very similar doesn't". This shows up in a number of things: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". From the user's perspective this seems like a bug in the TOML parser. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the code adding multibyte support in the first case will probably be harder, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something "Extra Augmented BNF?" Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and after looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later. Being conservative for these type of things is is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them and the specification even strongly discourages people from using them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all. People don't read specifications, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months later) it seems to "break". It should either allow everything or nothing. This in-between is just horrible. From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. - Maps to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
I believe this would greatly improve things and solves all the issues, mostly. It's a bit more complex, but not overly so, and can be implemented without a Unicode library without too much effort. It offers a good middle ground, IMHO. I don't think there are ANY perfect solutions here and that *anything* will be a trade-off. That said, I do believe some trade-offs are better than others, and I've made it no secret that I feel the current trade-off is a bad one. After looking at a bunch of different options I believe this is by far the best path for TOML. Advantages: - This is what I would consider the "minimal set" of characters we need to add for reasonable international support, meaning we can't really make a mistake with this by accidentally allowing too much. We can add new ranges in TOML 1.2 (or even change the entire approach, although I'd be very surprised if we need to), based on actual real-world feedback, but any approach we will take will need to include letters and digits from all scripts. This is a strong argument in favour of this and a huge improvement: we can't really do anything wrong here in a way that we can't correct later, unlike what we have now, which is "well I think it probably won't cause any problems, based on what these 5 European/American guys think, but if it does: we won't be able to correct it". Being conservative for these type of things is good! - This solves the normalisation issues, since combining characters are no longer allowed in bare keys, so it becomes a moot point. For quoted keys normalisation is mostly a non-issue because few people use them, which is why this gone largely unnoticed and undiscussed before the "Unicode in bare keys" PR was merged.[1] - It's consistent in what we allow: no "this character is allowed, but this very similar other thing isn't, what gives?!" Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but "this character works fine, but this very similar doesn't". This shows up in a number of things aside from emojis: a.toml: Input: ; = 42 # U+037E GREEK QUESTION MARK (Other_Punctuation) Error: line 1: expected '.' or '=', but got ';' instead b.toml: Input: · = 42 # # U+0387 GREEK ANO TELEIA (Other_Punctuation) Error: (none) c.toml: Input: – = 42 # U+2013 EN DASH (Dash_Punctuation) Error: line 1: expected '.' or '=', but got '–' instead d.toml: Input: ⁻ = 42 # U+207B SUPERSCRIPT MINUS (Math_Symbol) Error: (none) e.toml: Input: #x = "commented ... or is it?" # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) Error: (none) "Some punctuation is allowed but some isn't" is hard to explain, and also not what the specification says: "Punctuation, spaces, arrows, box drawing and private use characters are not allowed." In reality, a lot of punctuation IS allowed, but not all (especially outside of the Latin character range by the way, which shows the Euro/US bias in how it's written). People don't read specifications in great detail, nor should they. People try something and sees if it works. Now it seems to work on first approximation, and then (possibly months or years later) it seems to "suddenly break". From the user's perspective this seems like a bug in the TOML parser, but it's not: it's a bug in the specification. It should either allow everything or nothing. This in-between is confusing and horrible. There is no good way to communicate this other than "these codepoints, which cover most of what you'd write in a sentence, except when it doesn't". In contrast, "we allow letters and digits" is simple to spec, simple to communicate, and should have a minimum potential for confusion. The current spec disallows some things seemingly almost arbitrary while allowing other very similar characters. - This avoids a long list of confusable special TOML characters; some were mentioned above but there are many more: '#' U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation) '"' U+FF02 FULLWIDTH QUOTATION MARK (Other_Punctuation) '﹟' U+FE5F SMALL NUMBER SIGN (Other_Punctuation) '﹦' U+FE66 SMALL EQUALS SIGN (Math_Symbol) '﹐' U+FE50 SMALL COMMA (Other_Punctuation) '︲' U+FE32 PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation) '˝' U+02DD DOUBLE ACUTE ACCENT (Modifier_Symbol) '՚' U+055A ARMENIAN APOSTROPHE (Other_Punctuation) '܂' U+0702 SYRIAC SUBLINEAR FULL STOP (Other_Punctuation) 'ᱹ' U+1C79 OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter) '₌' U+208C SUBSCRIPT EQUALS SIGN (Math_Symbol) '⹀' U+2E40 DOUBLE HYPHEN (Dash_Punctuation) '࠰' U+0830 SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation) Is this a big problem? I guess it depends; I can certainly imagine an Armenian speaker accidentally leaving an Armenian apostrophe. Confusables is also an issue with different scripts (Latin and Cyrillic is well-known), but this is less of an issue since it's not syntax, and also something that's fundamentally unavoidable in any multi-script environment. - Maps closer to identifiers in more (though not all) languages. We discussed whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and while views differ (mostly because they're both) it seems to me that making it map *closer* is better. This is a minor issue, but it's nice. That does not mean it's perfect; as I mentioned all solutions come with a trade-off. The ones made here are: - The biggest issue by far is that the check to see if a character is valid may become more complex for some languages and environments that can't rely on a Unicode database being present. However, implementing this check is trivial logic-wise: it just needs to loop over every character and check if it's in a range table. You already need this with TOML 1.0, it's just that the range tables become larger. The downside is it needs a somewhat large-ish "allowed characters" table with 716 start/stop ranges, which is not ideal, but entirely doable and easily auto-generated. It's ~164 lines hard-wrapped at column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387 lines, so that seems within the limits of reason (actually, reading through the tomlc99 code adding multibyte support at all will be the harder part, with this range table being a minor part). - There's a new Unicode version roughly every year or so, and the way it's written now means it's "locked" to Unicode 9 or, optionally, a later version. This is probably fine: Apple's APFS filesystem (which does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2. Go is Unicode 8.0. etc. I don't think this is really much of an issue in practice. I choose Unicode 9 as everyone supports this; I doubted a long time over it, and we can also use a more recent version. I feel this gives us a nice balance between reasonable interoperability while also future-proofing things. - ABNF doesn't support Unicode. This is a tooling issue, and in my opinion the tooling should adjust to how we want TOML to look like, rather than adjusting TOML to what tooling supports. AFAIK no one uses the ABNF directly in code, and it's merely "informational". I'm not happy with this, but personally I think this should be a non-issue when considering what to do here. We're not the only people running in to this limitation, and is really something that IETF should address in a new RFC or something ("Extra Augmented BNF"?) Another solution I tried is restricting the code ranges; I twice tried to do this (with some months in-between) and spent a long time looking at Unicode blocks and ranges, and I found this impractical: we'll end up with a long list which isn't all that different from what this proposal adds. Fixes toml-lang#954 Fixes toml-lang#966 Fixes toml-lang#979 Ref toml-lang#687 Ref toml-lang#891 Ref toml-lang#941 --- [1]: Aside: I encountered this just the other day as I created a TOML file with all UK election results since 1945, which looks like: [1950] Labour = [13_266_176, 315, 617] Conservative = [12_492_404, 298, 619] Liberal = [ 2_621_487, 9, 475] Sinn_Fein = [ 23_362, 0, 2] That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote it as Sinn_Fein. This is what most people seem to do.
If this is one of the remaining blockers for 1.1.0-rc0, what if we instead defer bare keys to 1.2? |
At this point, I'm inclined to agree, and to make extending the allowable bare keys a primary objective for the future TOML 1.2.0. No offense to all the hard work put forward to make this viable, but while @pradyunsg is still MIA, we should slim down for now to get him back here for a while, work out how to proceed with the standards project, and get 1.1.0-rc1 out the door. So let's save this issue, and all the other open issues regarding the extension of bare keys, for after 1.1.0 is released, then hit it full-bore with the best solution we can devise, with a scheduled date for release and a dedicated core team surrounding the standard and dealing with day-to-day matters. Sorry @ChristianSi, I know this is a bitter pill to swallow, but we've waited too long, and we know what else needs done right now. |
I don't really see any serious blockers, neither this or anything else. But, as @eksortso has already mentioned, there hasn't been a working maintainer for the last few months (at least), so the project is effectively stuck. @eksortso If you have an idea on how to solve this, I'd be interested to hear it! |
@ChristianSi wasn't there an issue not so long ago to assign a new maintainer? @pradyunsg, would you allow another maintainer to the team? |
I was writing test cases for this, and using a pirate flag (🏴☠️) doesn't work; this is:
The flag and ZWJ is fine, but the skull and crossbones isn't allowed in the current range.
Seems confusing since most emojis work. Took me quite a bit of time to figure when modifying my parser to support this because I just assumed I missed something, but turns out it's just not in the allowed range:
Looking at the U+2500..U+2bff range, I don't really see why we need to skip a lot of these things.
I know we discussed this before, but I still think we should either allow only letters+numbers or just allow almost everything (with a few exceptions); the current behaviour is just confusing. The examples uses an emoji as an example and ZWJ is explicitly allowed, so you'd expect all emojis to work, but turns out only some emojis work. It just so happened by chance that "pirate flag" was the first emoji I tried, but there are probably others as well and with ZWJ combinations it'll be a whack-a-mole.
Either way, IMHO we should support all emojis or none. Many other ZWJ combinations do work fine; 🏳️🌈 (U+1F3F3 ZWJ U+1F308) or 🏴 is okay, but 🏳️⚧️ isn't (as U+26A7 isn't in the allowed range). In a quick test it seems all flags work, except two.
Originally posted by @arp242 in #891 (comment)
The text was updated successfully, but these errors were encountered: