You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In section "6.2 Entities" of the CommonMark spec the issue of HTML (or XHTML, or XML, or SGML) "entities" is discussed, apparently meaning strings like -,  , & that appear frequently in HTML (or ...) documents.
I can guess what is meant, but it is said in a rather fuzzy way:
The specification unfortunately talks in terms of "HTML Entities" and worse about storing Unicode characters in "the AST", or the kind of output that "renderers" receive: all of this has no place in a CommonMark specification IMO. This current wording has multiple flaws (independant from the intended meaning; the wording is:
too HTML-centric (numeric character references are at the core of HTML and XML, namely inherited from SGML);
wrong in the sense that eg "&" is not an entity, but an entity reference,
misleadingly wrong in the case of "-" which has nothing to do with entities or entity references, but is a (numeric) character reference,
too implementation-centric: what is relevant (for CommonMark authors and implementors) is whether the CommonMark parser "sees" the replaced characters or the literal character references: it turns out the former, but this has nothing to do with "storing Unicode characters in the AST", or what kind of output a "renderer receives".
I think that the CommonMark specification should talk only about the syntactic conventions used in CommonMark as far as possible, and refrain from prescribing implementation details.
Particulary problematic is the use of the acronym "AST":
Neither the full term "Abstract Syntax Tree" (I assume this is meant!),
nor any definition of this computer science jargon term,
While not every CommonMark processor will even build an AST either.
A reader of the spec with no CS background will probably feel kind of helpless …
I think the term "AST" is used here to allude to an abstract concept, namely that a CommonMark processor detects and uses "the structure" inscribed in the input text in some way, and the spec is meant to define this way (kind-of an operational semantics).
"The structure" is again a rather elusive term, but in the CommonMark case it coincides exactly with (a part of) the "structure and content" of an HTML (or XML, or XHTML, or SGML) document: what is meant is the Information Set of the "target document" which the processor has to discover, use, and output in some form or another.
For the simple case of CommonMark "structured documents", it consists of:
a nested structure of "elements" (having "types" and "attributes"),
and (Unicode) character content inside (some of) these "elements".
This could probably be explained in a reasonably simple way by referring to some kind of concrete representation of the structure "discovered" in the CommonMark text, ie by answering questions like:
where do the nested elements (ie a heading, a paragraph, a list item) start and end?
how are chunks of the input text used as character content for these nested elements (that's where the character references come in again)?
Answers could be "demonstrated" maybe by presenting the corresponding CommonMark "native" XML fragments (to alleviate a bit the heavy HTML bias throughout the spec), or in some form of (tree-like?) diagrams, and so on.
But it would sure be helpful to have some table or list of "element types", together with their "attributes" and other properties, to get an overview of what kind of "structured document" we're talking about.
And because the CommonMark DTD is (supposed to be) tailored to match the abstract structure of the content of a CommonMark text, it should be a better reference for explaining that structure than HTML (which has a lot of other "elements", "attributes", and peculiarities).
The text was updated successfully, but these errors were encountered:
In section "6.2 Entities" of the CommonMark spec the issue of HTML (or XHTML, or XML, or SGML) "entities" is discussed, apparently meaning strings like
-
, 
,&
that appear frequently in HTML (or ...) documents.I can guess what is meant, but it is said in a rather fuzzy way:
The specification unfortunately talks in terms of "HTML Entities" and worse about storing Unicode characters in "the AST", or the kind of output that "renderers" receive: all of this has no place in a CommonMark specification IMO. This current wording has multiple flaws (independant from the intended meaning; the wording is:
&
" is not an entity, but an entity reference,-
" which has nothing to do with entities or entity references, but is a (numeric) character reference,I think that the CommonMark specification should talk only about the syntactic conventions used in CommonMark as far as possible, and refrain from prescribing implementation details.
Particulary problematic is the use of the acronym "AST":
A reader of the spec with no CS background will probably feel kind of helpless …
I think the term "AST" is used here to allude to an abstract concept, namely that a CommonMark processor detects and uses "the structure" inscribed in the input text in some way, and the spec is meant to define this way (kind-of an operational semantics).
"The structure" is again a rather elusive term, but in the CommonMark case it coincides exactly with (a part of) the "structure and content" of an HTML (or XML, or XHTML, or SGML) document: what is meant is the Information Set of the "target document" which the processor has to discover, use, and output in some form or another.
For the simple case of CommonMark "structured documents", it consists of:
This could probably be explained in a reasonably simple way by referring to some kind of concrete representation of the structure "discovered" in the CommonMark text, ie by answering questions like:
Answers could be "demonstrated" maybe by presenting the corresponding CommonMark "native" XML fragments (to alleviate a bit the heavy HTML bias throughout the spec), or in some form of (tree-like?) diagrams, and so on.
But it would sure be helpful to have some table or list of "element types", together with their "attributes" and other properties, to get an overview of what kind of "structured document" we're talking about.
And because the CommonMark DTD is (supposed to be) tailored to match the abstract structure of the content of a CommonMark text, it should be a better reference for explaining that structure than HTML (which has a lot of other "elements", "attributes", and peculiarities).
The text was updated successfully, but these errors were encountered: