-
Notifications
You must be signed in to change notification settings - Fork 13
List Syntax
Will Duquette edited this page Sep 21, 2019
·
1 revision
A discussion of the string representation of lists from tclUtil.c in the current Tcl code base.
/*
* * STRING REPRESENTATION OF LISTS * * *
*
* The next several routines implement the conversions of strings to and from
* Tcl lists. To understand their operation, the rules of parsing and
* generating the string representation of lists must be known. Here we
* describe them in one place.
*
* A list is made up of zero or more elements. Any string is a list if it is
* made up of alternating substrings of element-separating ASCII whitespace
* and properly formatted elements.
*
* The ASCII characters which can make up the whitespace between list elements
* are:
*
* \u0009 \t TAB
* \u000A \n NEWLINE
* \u000B \v VERTICAL TAB
* \u000C \f FORM FEED
* \u000D \r CARRIAGE RETURN
* \u0020 SPACE
*
* NOTE: differences between this and other places where Tcl defines a role
* for "whitespace".
*
* * Unlike command parsing, here NEWLINE is just another whitespace
* character; its role as a command terminator in a script has no
* importance here.
*
* * Unlike command parsing, the BACKSLASH NEWLINE sequence is not
* considered to be a whitespace character.
*
* * Other Unicode whitespace characters (recognized by [string is space]
* or Tcl_UniCharIsSpace()) do not play any role as element separators
* in Tcl lists.
*
* * The NUL byte ought not appear, as it is not in strings properly
* encoded for Tcl, but if it is present, it is not treated as
* separating whitespace, or a string terminator. It is just another
* character in a list element.
*
* The interpretation of a formatted substring as a list element follows rules
* similar to the parsing of the words of a command in a Tcl script. Backslash
* substitution plays a key role, and is defined exactly as it is in command
* parsing. The same routine, TclParseBackslash() is used in both command
* parsing and list parsing.
*
* NOTE: This means that if and when backslash substitution rules ever change
* for command parsing, the interpretation of strings as lists also changes.
*
* Backslash substitution replaces an "escape sequence" of one or more
* characters starting with
* \u005c \ BACKSLASH
* with a single character. The one character escape sequence case happens only
* when BACKSLASH is the last character in the string. In all other cases, the
* escape sequence is at least two characters long.
*
* The formatted substrings are interpreted as element values according to the
* following cases:
*
* * If the first character of a formatted substring is
* \u007b { OPEN BRACE
* then the end of the substring is the matching
* \u007d } CLOSE BRACE
* character, where matching is determined by counting nesting levels, and
* not including any brace characters that are contained within a backslash
* escape sequence in the nesting count. Having found the matching brace,
* all characters between the braces are the string value of the element.
* If no matching close brace is found before the end of the string, the
* string is not a Tcl list. If the character following the close brace is
* not an element separating whitespace character, or the end of the string,
* then the string is not a Tcl list.
*
* NOTE: this differs from a brace-quoted word in the parsing of a Tcl
* command only in its treatment of the backslash-newline sequence. In a
* list element, the literal characters in the backslash-newline sequence
* become part of the element value. In a script word, conversion to a
* single SPACE character is done.
*
* NOTE: Most list element values can be represented by a formatted
* substring using brace quoting. The exceptions are any element value that
* includes an unbalanced brace not in a backslash escape sequence, and any
* value that ends with a backslash not itself in a backslash escape
* sequence.
*
* * If the first character of a formatted substring is
* \u0022 " QUOTE
* then the end of the substring is the next QUOTE character, not counting
* any QUOTE characters that are contained within a backslash escape
* sequence. If no next QUOTE is found before the end of the string, the
* string is not a Tcl list. If the character following the closing QUOTE is
* not an element separating whitespace character, or the end of the string,
* then the string is not a Tcl list. Having found the limits of the
* substring, the element value is produced by performing backslash
* substitution on the character sequence between the open and close QUOTEs.
*
* NOTE: Any element value can be represented by this style of formatting,
* given suitable choice of backslash escape sequences.
*
* * All other formatted substrings are terminated by the next element
* separating whitespace character in the string. Having found the limits
* of the substring, the element value is produced by performing backslash
* substitution on it.
*
* NOTE: Any element value can be represented by this style of formatting,
* given suitable choice of backslash escape sequences, with one exception.
* The empty string cannot be represented as a list element without the use
* of either braces or quotes to delimit it.
*
* This collection of parsing rules is implemented in the routine
* FindElement().
*
* In order to produce lists that can be parsed by these rules, we need the
* ability to distinguish between characters that are part of a list element
* value from characters providing syntax that define the structure of the
* list. This means that our code that generates lists must at a minimum be
* able to produce escape sequences for the 10 characters identified above
* that have significance to a list parser.
*
* * * CANONICAL LISTS * * * * *
*
* In addition to the basic rules for parsing strings into Tcl lists, there
* are additional properties to be met by the set of list values that are
* generated by Tcl. Such list values are often said to be in "canonical
* form":
*
* * When any canonical list is evaluated as a Tcl script, it is a script of
* either zero commands (an empty list) or exactly one command. The command
* word is exactly the first element of the list, and each argument word is
* exactly one of the following elements of the list. This means that any
* characters that have special meaning during script evaluation need
* special treatment when canonical lists are produced:
*
* * Whitespace between elements may not include NEWLINE.
* * The command terminating character,
* \u003b ; SEMICOLON
* must be BRACEd, QUOTEd, or escaped so that it does not terminate the
* command prematurely.
* * Any of the characters that begin substitutions in scripts,
* \u0024 $ DOLLAR
* \u005b [ OPEN BRACKET
* \u005c \ BACKSLASH
* need to be BRACEd or escaped.
* * In any list where the first character of the first element is
* \u0023 # HASH
* that HASH character must be BRACEd, QUOTEd, or escaped so that it
* does not convert the command into a comment.
* * Any list element that contains the character sequence BACKSLASH
* NEWLINE cannot be formatted with BRACEs. The BACKSLASH character
* must be represented by an escape sequence, and unless QUOTEs are
* used, the NEWLINE must be as well.
*
* * It is also guaranteed that one can use a canonical list as a building
* block of a larger script within command substitution, as in this example:
* set script "puts \[[list $cmd $arg]]"; eval $script
* To support this usage, any appearance of the character
* \u005d ] CLOSE BRACKET
* in a list element must be BRACEd, QUOTEd, or escaped.
*
* * Finally it is guaranteed that enclosing a canonical list in braces
* produces a new value that is also a canonical list. This new list has
* length 1, and its only element is the original canonical list. This same
* guarantee also makes it possible to construct scripts where an argument
* word is given a list value by enclosing the canonical form of that list
* in braces:
* set script "puts {[list $one $two $three]}"; eval $script
* This sort of coding was once fairly common, though it's become more
* idiomatic to see the following instead:
* set script [list puts [list $one $two $three]]; eval $script
* In order to support this guarantee, every canonical list must have
* balance when counting those braces that are not in escape sequences.
*
* Within these constraints, the canonical list generation routines
* TclScanElement() and TclConvertElement() attempt to generate the string for
* any list that is easiest to read. When an element value is itself
* acceptable as the formatted substring, it is usually used (CONVERT_NONE).
* When some quoting or escaping is required, use of BRACEs (CONVERT_BRACE) is
* usually preferred over the use of escape sequences (CONVERT_ESCAPE). There
* are some exceptions to both of these preferences for reasons of code
* simplicity, efficiency, and continuation of historical habits. Canonical
* lists never use the QUOTE formatting to delimit their elements because that
* form of quoting does not nest, which makes construction of nested lists far
* too much trouble. Canonical lists always use only a single SPACE character
* for element-separating whitespace.
*
* * * FUTURE CONSIDERATIONS * * *
*
* When a list element requires quoting or escaping due to a CLOSE BRACKET
* character or an internal QUOTE character, a strange formatting mode is
* recommended. For example, if the value "a{b]c}d" is converted by the usual
* modes:
*
* CONVERT_BRACE: a{b]c}d => {a{b]c}d}
* CONVERT_ESCAPE: a{b]c}d => a\{b\]c\}d
*
* we get perfectly usable formatted list elements. However, this is not what
* Tcl releases have been producing. Instead, we have:
*
* CONVERT_MASK: a{b]c}d => a{b\]c}d
*
* where the CLOSE BRACKET is escaped, but the BRACEs are not. The same effect
* can be seen replacing ] with " in this example. There does not appear to be
* any functional or aesthetic purpose for this strange additional mode. The
* sole purpose I can see for preserving it is to keep generating the same
* formatted lists programmers have become accustomed to, and perhaps written
* tests to expect. That is, compatibility only. The additional code
* complexity required to support this mode is significant. The lines of code
* supporting it are delimited in the routines below with #if COMPAT
* directives. This makes it easy to experiment with eliminating this
* formatting mode simply with "#define COMPAT 0" above. I believe this is
* worth considering.
*
* Another consideration is the treatment of QUOTE characters in list
* elements. TclConvertElement() must have the ability to produce the escape
* sequence \" so that when a list element begins with a QUOTE we do not
* confuse that first character with a QUOTE used as list syntax to define
* list structure. However, that is the only place where QUOTE characters need
* quoting. In this way, handling QUOTE could really be much more like the way
* we handle HASH which also needs quoting and escaping only in particular
* situations. Following up this could increase the set of list elements that
* can use the CONVERT_NONE formatting mode.
*
* More speculative is that the demands of canonical list form require brace
* balance for the list as a whole, while the current implementation achieves
* this by establishing brace balance for every element.
*
* Finally, a reminder that the rules for parsing and formatting lists are
* closely tied together with the rules for parsing and evaluating scripts,
* and will need to evolve in sync.
*/