-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] more permissive definition of allowed symbol syntax #296
Comments
There's not supposed to be a difference in parsing, but the way symbols are rendered is an implementation detail. The line you point to in the Haskell code to establish a difference in parsing has nothing to do with parsing, that's part of the djot renderer. |
since i know nothing about Haskell, it was a guess tbo, but may i have a valid proposal apart from the code ? |
Actually, you're right that the Haskell implementation doesn't allow I do want to keep symbols "abstract," in the sense that no particular rendering is mandated for them (they can be treated in different ways by the renderer). But your main suggestion is to broaden the characters allowed in symbols, to make it easier to use them for things like I'm open to that, I suppose. Comments from others welcome. |
Like Lua and Js implementations. See jgm/djot#296.
yes that is what i would suggest, and having feedback from other sources both pro and con is appreciated. |
with pandoc in mind, i wouldnt mandate a particular rendering either. |
@jgm I wrote a JS filter which converts symbols looking like a decimal |
so you actually like my proposal ? because then the |
@terefang I like the idea but not the "syntax": we already have the colons so the full {HT,X}ML entity syntax is overkill and the I say allow any non-blank ASCII char other than Footnotes
|
hmm ... so it would be written all:
|
With my current filters The UI, my keyboard or my brain (probably the last since the colons aren't included in the string the filter deals with) omitted some colons in my previous post. Should hopefully be fixed now. |
FWIW |
@bpj your js might actually benefit from my proposal the most because of those icon libraries, isnt it so ? |
Not sure I get it all -- but in my current implementation (of a renderer), I use |
@terefang I'm mostly with @jgm: what to do with symbols should be up to filters/renderers but some more permitted characters between the colons to allow a bit of "Hungarian typing" among symbols would be welcome, or even some kind of actual namespaces for symbols, although that is probably overkill. Symbols were redefined from "emoji" to avoid hardcoding a table of alias—emoji mappings in the parser, which was a good thing to do. Not everyone uses emoji, or any of the other things you list, and the data structures needed for each of them are big, albeit not enormously big (although I have a script which generates a table mapping all non-Han Unicode names and aliases to chars, and that is enormous 😁 even though only a fraction of all possible Unicode chars are assigned yet!), so it is good design to let filters/renderers decide what to use symbols for, so that everyone can use symbols for what suits them and only that. The problem is that some naughty users may want to use symbols for more than one thing, making the possibility of some kind of (pseudo-) namespacing desirable. The set of characters allowed in symbols were originally determined by what gemoji aliases use, so it makes sense to extend the set of permitted characters when symbols are to be used for other things as well. It is easy to use regex or substring comparison to determine whether any given symbol has the right format for a given use case, so there is no need to hardcode anything on the parser side. The alias—emoji mappings are hardcoded in my filter by inlining the JSON, but that is mostly because reading files in JS isn't possible without third-party libraries. A more "serious" application than my filter can address that. |
@Omikhleia Absolutely: my filter also recognises |
By the way, since I use pseudo-footnotes as symbol definitions in my rendering layer, in my own use case the following should theoretically work:
("theoretically" = I didn't try, as I don't use emojis -- but I know it works for |
how would you use the glyphs/icons from the "font-awesome" font ? my suggestion would be how would you separate icon-sets used in parallel ? (fontawesome, octoicons, material design)
|
@terefang Isn't |
my use case would be to have a pdf renderer, the implementation would need to infer from the symbol what is ment by the content author. i would assume that a content author using the symbol syntax, would be either be familiar with html-style entities or has registered a particular icon font in the rendering backend. tbo, i have currently a test-case setup where an iconfont ("octoicons.svg") is registered under a prefix ("o"). so i have currently identified the following use cases:
some donts:
hope that helped understanding. |
Djot should not be concerned and have not a hardcoded list (of words) to work, but rather allow a syntax for a particular backend renderer to do its job. just allowing more than just identifier characters in symbol names (eg. adding that would result in the following allowance: but i really need to add that i really like the suggestion of the |
i would like to propose the following language in the specification: SymbolsSurrounding a word with
To be precise, the allowed characters in a symbols word are |
Same use case for me, I'm also rendering to PDF 👍
I wouldn't necessarily assume a familiarity with "HTML-style" entities -- my "writers" don't really know HTML (or to some extent only). But the initial question goes much farther than just HTML entity name. Font awesome, octicons, material design, emojis... and what's next? Huge custom tables for all of these evolving things? And possibly, what about eventual overlaps? (see below)
Authors don't have to know -- the aliases might be provided in another (definition) file.
Now authors have to know CSS-like styles? TL,DR... But what is a "smiley face"?
The "authors" should therefore perhaps just type |
Somewhat unrelated, but while we are at it: |
but the content author would be familiar with the symbols she need from given documentation of the implementation
nobody expects you to implement any tables, unless it is for your own use-case.
yes they will get it fro the documentation of the backend or by their own doing from "implementation depended"
again, they dont need to know, but a particular implementation could leverage on existing prior knowledge. why force backend implementors doing another indirection with lookup-tables (you already disregarded above) because of a limitation of allowed characters. it would be simpler just to allow a larger set of characters to remove that requirement.
while i like for content creators to simply use that said, an implementation could leverage on the data that is already present -- like reading glyph-names from Truetype-/Opentype-/Type1-/SVG-Fonts. maybe i am just thinking way to advanced into the workflow but follow be thru the following example:
|
good point -- me thinks that the lua-based reference implementation of "%w" will only match ascii letters. but for me:
while i personally could live with the ascii limitation, i would not like to enforce it |
Requiring ASCII was a pragmatic decision -- I wanted to make it easy to implement lightweight parsers for djot. Say we allow non-ASCII alphanumerics. Well, then every djot parser needs code that (a) parses UTF-8 byte sequences to code points and (b) determines which code points are alphanumerics. This is actually a decent amount of additional complexity which we can avoid by requiring ASCII here. There isn't anywhere else where djot parsing requires determining character classes of non-ASCII characters. |
You might say: well, any decent language has these built in! Not C. Not Lua. |
i know, that is why i did not mention it specifically, and i would like to use a lua-filter within pandoc for this. |
and i am with @jgm ... doing internationlization for internationalization sake might be actually a bad decision. pragmatic and keep-it-simple-stupid ! while Unicode is the target we should strife to, ASCII is our basis. |
hmm ... while as a quick fallback ... not considering correct character classes. if you are processing symbol markup in 8bit mode only, any byte > 127 could be treated as a word character - that would also satisfy UTF-8. |
Yes, we don't need to worry about character classes if we just want to accept any non-ASCII character as a character in a symbol. But I don't think we do; these include, for example, lots of spacing characters, accents, and all manner of things. |
i would like to propose the following updated language in the specification: SymbolsSurrounding a word with
Notes:
|
Your proposal, if I understand it correctly, makes parsing implementation-dependent. I don't think that's good. There should be one answer to the question, "is this text a symbol?" |
@jgm I’m totally clear and OK as to why symbols are restricted to ASCII. I just think that all printable (i.e. not controls and perhaps not space either) except Yesterday I added the “syntax” I’m thinking that maybe paired brackets Footnotes
|
talking to people brings fruits ... actually i totally disregarded the whitespace case. although i see the use-case, that goes head over heels and way beyond the intention of a "replacement symbol". and while it is possible to to simply parse unicode character classes with regex, you needs would require significat complexity. |
then i would like to propose the following updated language in the specification: SymbolsSurrounding a word with
Notes:
|
Can you explain why we need the second and third clauses? Isn't the first clause unambiguous and sufficient to specify the syntax? Are there some regex implementations that include non-ASCII characters in |
a good example is your usage of |
Technically all Lua character classes are locale dependent, so that if I set for example a Swedish 8-bit locale whatever bytes it uses to encode ‹ÅÄÖåäö› are included in As for Perl in recent versions |
So |
then i would like to propose the following updated language in the specification: SymbolsSurrounding a word with
Note: To be precise, the allowed characters in a symbols word are |
👍 |
will you close this after you have updated spec and code ? |
currently the definition is as in https://htmlpreview.github.io/?https://github.com/jgm/djot/blob/master/doc/syntax.html#symbols
which says:
so the implementation is highly parser and renderer specific.
so it seems that as least comparing the Haskell and Lua implementations there is some disagreement
Proposal
[a-zA-Z0-9\_\-]+
).Use-Cases
XML/HTML Renderer
HTML Entity
:apos:
to be rendered as'
:Euro:
to be rendered as&Euro;
HTML Entity Code Point
:#60:
to be rendered as<
or<
:#x2014:
to be rendered as—
or—
or—
or"—"
Icon Font Glyph Name
:fa-bars:
to be rendered as<i class="fa fa-bars"></i>
:fa+fa-bars:
to be rendered as<i class="fa fa-bars"></i>
This may be subject to that actual implementation and/or configuration of the html-renderer backend.
Icon or Symbol Font Glyph Name for PDF, Image, or Unicode Text Renderer
:a19:
to be rendered as "✓" (from Zapf Dingbats font) -- (U+2713 CHECK MARK ✓, ✓
)Possible Variation
it might be desirable to clearly separate the verbatim html entities from glyph names by using a prefix for indication.
:*a19:
to be rendered as "✓" (from Zapf Dingbats font) -- (U+2713 CHECK MARK ✓, ✓
)Possible candidates would be :
^
,!
,&
,$
,%
,/
,=
,?
,+
,~
,*
,#
So a possible syntax could be in perl-style regular expression:
Graceful Fallback Mechanism
where a renderer backend might not be able to recognize which symbol or glyph to actually render it might fallback i the following ways:
pure text style
:apos:
to be rendered as:apos:
:Euro:
to be rendered as:Euro:
:#60:
to be rendered as:#60:
:#x2014:
to be rendered as:#x2014:
:fa-bars:
to be rendered as:fa-bars:
:fa+fa-bars:
to be rendered as:fa+fa-bars:
:a19:
to be rendered as:a19:
:*a19:
to be rendered as:*a19:
html style
:fa-bars:
to be rendered as<code>:fa-bars:</code>
:fa+fa-bars:
to be rendered as<code>:fa+fa-bars:</code>
:a19:
to be rendered as<code>:a19:</code>
:*a19:
to be rendered as<code>:*a19:</code>
Entities that should always be recognized
 
, 
, etc.The text was updated successfully, but these errors were encountered: