Implementation bias in CommonMark spec (wrt -   & and so on) #375

tin-pot · 2015-11-05T01:00:08Z

In section "6.2 Entities" of the CommonMark spec the issue of HTML (or XHTML, or XML, or SGML) "entities" is discussed, apparently meaning strings like -,  , & that appear frequently in HTML (or ...) documents.

I can guess what is meant, but it is said in a rather fuzzy way:

The specification unfortunately talks in terms of "HTML Entities" and worse about storing Unicode characters in "the AST", or the kind of output that "renderers" receive: all of this has no place in a CommonMark specification IMO. This current wording has multiple flaws (independant from the intended meaning; the wording is:

too HTML-centric (numeric character references are at the core of HTML and XML, namely inherited from SGML);
wrong in the sense that eg "&" is not an entity, but an entity reference,
misleadingly wrong in the case of "-" which has nothing to do with entities or entity references, but is a (numeric) character reference,
too implementation-centric: what is relevant (for CommonMark authors and implementors) is whether the CommonMark parser "sees" the replaced characters or the literal character references: it turns out the former, but this has nothing to do with "storing Unicode characters in the AST", or what kind of output a "renderer receives".

I think that the CommonMark specification should talk only about the syntactic conventions used in CommonMark as far as possible, and refrain from prescribing implementation details.

Particulary problematic is the use of the acronym "AST":

Neither the full term "Abstract Syntax Tree" (I assume this is meant!),
nor any definition of this computer science jargon term,
not even a link to Wikipedia is given.
While not every CommonMark processor will even build an AST either.

A reader of the spec with no CS background will probably feel kind of helpless …

I think the term "AST" is used here to allude to an abstract concept, namely that a CommonMark processor detects and uses "the structure" inscribed in the input text in some way, and the spec is meant to define this way (kind-of an operational semantics).

"The structure" is again a rather elusive term, but in the CommonMark case it coincides exactly with (a part of) the "structure and content" of an HTML (or XML, or XHTML, or SGML) document: what is meant is the Information Set of the "target document" which the processor has to discover, use, and output in some form or another.

For the simple case of CommonMark "structured documents", it consists of:

a nested structure of "elements" (having "types" and "attributes"),
and (Unicode) character content inside (some of) these "elements".

This could probably be explained in a reasonably simple way by referring to some kind of concrete representation of the structure "discovered" in the CommonMark text, ie by answering questions like:

where do the nested elements (ie a heading, a paragraph, a list item) start and end?
how are chunks of the input text used as character content for these nested elements (that's where the character references come in again)?

Answers could be "demonstrated" maybe by presenting the corresponding CommonMark "native" XML fragments (to alleviate a bit the heavy HTML bias throughout the spec), or in some form of (tree-like?) diagrams, and so on.

But it would sure be helpful to have some table or list of "element types", together with their "attributes" and other properties, to get an overview of what kind of "structured document" we're talking about.

And because the CommonMark DTD is (supposed to be) tailored to match the abstract structure of the content of a CommonMark text, it should be a better reference for explaining that structure than HTML (which has a lot of other "elements", "attributes", and peculiarities).

The text was updated successfully, but these errors were encountered:

jgm · 2015-12-28T07:41:42Z

Action items:

Remove or rephrase the two references in the spec to the "AST"
Fix terminology: &amp is an entity reference, not an entity; - is a numeric character reference.

jgm closed this as completed in d336cfb Dec 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation bias in CommonMark spec (wrt -   & and so on) #375

Implementation bias in CommonMark spec (wrt -   & and so on) #375

tin-pot commented Nov 5, 2015

jgm commented Dec 28, 2015

Implementation bias in CommonMark spec (wrt &#45; &#xA0; &amp; and so on) #375

Implementation bias in CommonMark spec (wrt &#45; &#xA0; &amp; and so on) #375

Comments

tin-pot commented Nov 5, 2015

jgm commented Dec 28, 2015

Implementation bias in CommonMark spec (wrt - & and so on) #375

Implementation bias in CommonMark spec (wrt - & and so on) #375