Skip to content

List Syntax

Will Duquette edited this page Sep 21, 2019 · 1 revision

A discussion of the string representation of lists from tclUtil.c in the current Tcl code base.

/*
 *	*	STRING REPRESENTATION OF LISTS	*	*	*
 *
 * The next several routines implement the conversions of strings to and from
 * Tcl lists. To understand their operation, the rules of parsing and
 * generating the string representation of lists must be known.  Here we
 * describe them in one place.
 *
 * A list is made up of zero or more elements. Any string is a list if it is
 * made up of alternating substrings of element-separating ASCII whitespace
 * and properly formatted elements.
 *
 * The ASCII characters which can make up the whitespace between list elements
 * are:
 *
 *	\u0009	\t	TAB
 *	\u000A	\n	NEWLINE
 *	\u000B	\v	VERTICAL TAB
 *	\u000C	\f	FORM FEED
 * 	\u000D	\r	CARRIAGE RETURN
 *	\u0020		SPACE
 *
 * NOTE: differences between this and other places where Tcl defines a role
 * for "whitespace".
 *
 *	* Unlike command parsing, here NEWLINE is just another whitespace
 *	  character; its role as a command terminator in a script has no
 *	  importance here.
 *
 *	* Unlike command parsing, the BACKSLASH NEWLINE sequence is not
 *	  considered to be a whitespace character.
 *
 *	* Other Unicode whitespace characters (recognized by [string is space]
 *	  or Tcl_UniCharIsSpace()) do not play any role as element separators
 *	  in Tcl lists.
 *
 *	* The NUL byte ought not appear, as it is not in strings properly
 *	  encoded for Tcl, but if it is present, it is not treated as
 *	  separating whitespace, or a string terminator. It is just another
 *	  character in a list element.
 *
 * The interpretation of a formatted substring as a list element follows rules
 * similar to the parsing of the words of a command in a Tcl script. Backslash
 * substitution plays a key role, and is defined exactly as it is in command
 * parsing. The same routine, TclParseBackslash() is used in both command
 * parsing and list parsing.
 *
 * NOTE: This means that if and when backslash substitution rules ever change
 * for command parsing, the interpretation of strings as lists also changes.
 *
 * Backslash substitution replaces an "escape sequence" of one or more
 * characters starting with
 *		\u005c	\	BACKSLASH
 * with a single character. The one character escape sequence case happens only
 * when BACKSLASH is the last character in the string. In all other cases, the
 * escape sequence is at least two characters long.
 *
 * The formatted substrings are interpreted as element values according to the
 * following cases:
 *
 * * If the first character of a formatted substring is
 *		\u007b	{	OPEN BRACE
 *   then the end of the substring is the matching
 *		\u007d	}	CLOSE BRACE
 *   character, where matching is determined by counting nesting levels, and
 *   not including any brace characters that are contained within a backslash
 *   escape sequence in the nesting count. Having found the matching brace,
 *   all characters between the braces are the string value of the element.
 *   If no matching close brace is found before the end of the string, the
 *   string is not a Tcl list. If the character following the close brace is
 *   not an element separating whitespace character, or the end of the string,
 *   then the string is not a Tcl list.
 *
 *   NOTE: this differs from a brace-quoted word in the parsing of a Tcl
 *   command only in its treatment of the backslash-newline sequence. In a
 *   list element, the literal characters in the backslash-newline sequence
 *   become part of the element value. In a script word, conversion to a
 *   single SPACE character is done.
 *
 *   NOTE: Most list element values can be represented by a formatted
 *   substring using brace quoting. The exceptions are any element value that
 *   includes an unbalanced brace not in a backslash escape sequence, and any
 *   value that ends with a backslash not itself in a backslash escape
 *   sequence.
 *
 * * If the first character of a formatted substring is
 *		\u0022	"	QUOTE
 *   then the end of the substring is the next QUOTE character, not counting
 *   any QUOTE characters that are contained within a backslash escape
 *   sequence. If no next QUOTE is found before the end of the string, the
 *   string is not a Tcl list. If the character following the closing QUOTE is
 *   not an element separating whitespace character, or the end of the string,
 *   then the string is not a Tcl list. Having found the limits of the
 *   substring, the element value is produced by performing backslash
 *   substitution on the character sequence between the open and close QUOTEs.
 *
 *   NOTE: Any element value can be represented by this style of formatting,
 *   given suitable choice of backslash escape sequences.
 *
 * * All other formatted substrings are terminated by the next element
 *   separating whitespace character in the string.  Having found the limits
 *   of the substring, the element value is produced by performing backslash
 *   substitution on it.
 *
 *   NOTE: Any element value can be represented by this style of formatting,
 *   given suitable choice of backslash escape sequences, with one exception.
 *   The empty string cannot be represented as a list element without the use
 *   of either braces or quotes to delimit it.
 *
 * This collection of parsing rules is implemented in the routine
 * FindElement().
 *
 * In order to produce lists that can be parsed by these rules, we need the
 * ability to distinguish between characters that are part of a list element
 * value from characters providing syntax that define the structure of the
 * list. This means that our code that generates lists must at a minimum be
 * able to produce escape sequences for the 10 characters identified above
 * that have significance to a list parser.
 *
 *	*	*	CANONICAL LISTS	*	*	*	*	*
 *
 * In addition to the basic rules for parsing strings into Tcl lists, there
 * are additional properties to be met by the set of list values that are
 * generated by Tcl.  Such list values are often said to be in "canonical
 * form":
 *
 * * When any canonical list is evaluated as a Tcl script, it is a script of
 *   either zero commands (an empty list) or exactly one command. The command
 *   word is exactly the first element of the list, and each argument word is
 *   exactly one of the following elements of the list. This means that any
 *   characters that have special meaning during script evaluation need
 *   special treatment when canonical lists are produced:
 *
 *	* Whitespace between elements may not include NEWLINE.
 *	* The command terminating character,
 *		\u003b	;	SEMICOLON
 *	  must be BRACEd, QUOTEd, or escaped so that it does not terminate the
 * 	  command prematurely.
 *	* Any of the characters that begin substitutions in scripts,
 *		\u0024	$	DOLLAR
 *		\u005b	[	OPEN BRACKET
 *		\u005c	\	BACKSLASH
 *	  need to be BRACEd or escaped.
 *	* In any list where the first character of the first element is
 *		\u0023	#	HASH
 *	  that HASH character must be BRACEd, QUOTEd, or escaped so that it
 *	  does not convert the command into a comment.
 *	* Any list element that contains the character sequence BACKSLASH
 *	  NEWLINE cannot be formatted with BRACEs. The BACKSLASH character
 *	  must be represented by an escape sequence, and unless QUOTEs are
 *	  used, the NEWLINE must be as well.
 *
 * * It is also guaranteed that one can use a canonical list as a building
 *   block of a larger script within command substitution, as in this example:
 *	set script "puts \[[list $cmd $arg]]"; eval $script
 *   To support this usage, any appearance of the character
 *		\u005d	]	CLOSE BRACKET
 *   in a list element must be BRACEd, QUOTEd, or escaped.
 *
 * * Finally it is guaranteed that enclosing a canonical list in braces
 *   produces a new value that is also a canonical list.  This new list has
 *   length 1, and its only element is the original canonical list.  This same
 *   guarantee also makes it possible to construct scripts where an argument
 *   word is given a list value by enclosing the canonical form of that list
 *   in braces:
 *	set script "puts {[list $one $two $three]}"; eval $script
 *   This sort of coding was once fairly common, though it's become more
 *   idiomatic to see the following instead:
 *	set script [list puts [list $one $two $three]]; eval $script
 *   In order to support this guarantee, every canonical list must have
 *   balance when counting those braces that are not in escape sequences.
 *
 * Within these constraints, the canonical list generation routines
 * TclScanElement() and TclConvertElement() attempt to generate the string for
 * any list that is easiest to read. When an element value is itself
 * acceptable as the formatted substring, it is usually used (CONVERT_NONE).
 * When some quoting or escaping is required, use of BRACEs (CONVERT_BRACE) is
 * usually preferred over the use of escape sequences (CONVERT_ESCAPE). There
 * are some exceptions to both of these preferences for reasons of code
 * simplicity, efficiency, and continuation of historical habits. Canonical
 * lists never use the QUOTE formatting to delimit their elements because that
 * form of quoting does not nest, which makes construction of nested lists far
 * too much trouble.  Canonical lists always use only a single SPACE character
 * for element-separating whitespace.
 *
 *	*	*	FUTURE CONSIDERATIONS	*	*	*
 *
 * When a list element requires quoting or escaping due to a CLOSE BRACKET
 * character or an internal QUOTE character, a strange formatting mode is
 * recommended. For example, if the value "a{b]c}d" is converted by the usual
 * modes:
 *
 *	CONVERT_BRACE:	a{b]c}d		=> {a{b]c}d}
 *	CONVERT_ESCAPE:	a{b]c}d		=> a\{b\]c\}d
 *
 * we get perfectly usable formatted list elements. However, this is not what
 * Tcl releases have been producing. Instead, we have:
 *
 *	CONVERT_MASK:	a{b]c}d		=> a{b\]c}d
 *
 * where the CLOSE BRACKET is escaped, but the BRACEs are not. The same effect
 * can be seen replacing ] with " in this example. There does not appear to be
 * any functional or aesthetic purpose for this strange additional mode. The
 * sole purpose I can see for preserving it is to keep generating the same
 * formatted lists programmers have become accustomed to, and perhaps written
 * tests to expect. That is, compatibility only. The additional code
 * complexity required to support this mode is significant. The lines of code
 * supporting it are delimited in the routines below with #if COMPAT
 * directives. This makes it easy to experiment with eliminating this
 * formatting mode simply with "#define COMPAT 0" above. I believe this is
 * worth considering.
 *
 * Another consideration is the treatment of QUOTE characters in list
 * elements. TclConvertElement() must have the ability to produce the escape
 * sequence \" so that when a list element begins with a QUOTE we do not
 * confuse that first character with a QUOTE used as list syntax to define
 * list structure. However, that is the only place where QUOTE characters need
 * quoting. In this way, handling QUOTE could really be much more like the way
 * we handle HASH which also needs quoting and escaping only in particular
 * situations. Following up this could increase the set of list elements that
 * can use the CONVERT_NONE formatting mode.
 *
 * More speculative is that the demands of canonical list form require brace
 * balance for the list as a whole, while the current implementation achieves
 * this by establishing brace balance for every element.
 *
 * Finally, a reminder that the rules for parsing and formatting lists are
 * closely tied together with the rules for parsing and evaluating scripts,
 * and will need to evolve in sync.
 */
Clone this wiki locally