-
I noticed that Here's a test case: import { toMarkdown } from "mdast-util-to-markdown";
const tree = {
type: "root",
children: [
{
type: "paragraph",
children: [
{ type: "text", value: "foo" },
{
type: "strong",
children: [{ type: "text", value: " this is bold " }],
},
{ type: "text", value: "bar" },
],
},
],
};
console.log(toMarkdown(tree)); Current output: Expected output: |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 23 replies
-
An interesting question @gschlager. The content example in mdast shared itself is not valid markdown. const tree = {
type: "root",
children: [
{
type: "paragraph",
children: [
{ type: "text", value: "foo" },
{
type: "strong",
children: [{ type: "text", value: " this is bold " }],
},
{ type: "text", value: "bar" },
],
},
],
}; adds spaces around I'm not sure I see the stringifier as being the place to put content validation/content fixing. |
Beta Was this translation helpful? Give feedback.
-
Hey again @gschlager! Sooo, it took me hours and hours of thinking but I managed to reproduce your AST in markdown 😅
-> Pretty doubtful that anyone would ever write that though. But even before I figured it out I already become more understanding of the use cases you mention, where folks are working on the ASTs and injecting punctuation/whitespace/whatever in text or adding/removing emphasis nodes. We already use character references in a couple places. When things can be input:  this  -> <p> this </p> …we try and output it too: import {toMarkdown} from 'mdast-util-to-markdown'
/** @type {import('mdast').Root} */
const tree = {
type: 'root',
children: [
{type: 'paragraph', children: [{type: 'text', value: ' this '}]}
]
}
console.log(toMarkdown(tree)) ->  this  That way, it roundtrips. We can turn anything into a character reference. x * this *
x * this *
x *.this.*
x *.this.* -> <p>x * this *</p>
<p>x <em> this </em></p>
<p>x <em>.this.</em></p>
<p>x <em>.this.</em></p> Emphasis can form based on the kind of character before and after the “run”, which can be whitespace, punctuation, or anything else. A bigger example (though still reduced, because left and right runs are the same and only looking at asterisks): | | A (letter inside) | B (punctuation inside) | C (whitespace inside) | D (nothing inside) |
| ----------------------- | ------------- | ------------------ | ----------------- | -------------- |
| 1 (letter outside) | x*y*z | x*.*z | x* *z | x**z |
| 2 (punctuation outside) | .*y*. | .*.*. | .* *. | .**. |
| 3 (whitespace outside) | x *y* z | x *.* z | x * * z | x ** z |
| 4 (nothing outside) | *x* | *.* | * * | ** | ->
Inspecting that, we can divide them into two groups: Now given our magic trick: we can turn letters (1 and A) and whitespace (3 and C) into punctuation (2 and B), by turning them into references (which start with
Visualizing that:
We observe some interesting aspects:
In this case though, we only care about list X: we’re looking at an AST import {toMarkdown} from 'mdast-util-from-markdown'
/** @type {import('mdast').Root} */
const tree = {type: 'root', children: [
{type: 'paragraph', children: [{type: 'text', value: 'a '},
{type: 'emphasis', children: []},
{type: 'text', value: ' b'}]}
]}
console.log(toMarkdown(tree)) -> (current) a ** b -> From the interesting aspects above, we found that we can add an encoded whitespace (a normal space, zero-width space? No break space?) inside it: punctuation around:
a.* *.b (space)
a.*​*.b (zwsp)
a.* *.b (nbsp)
whitespace around:
a * * b (space)
a *​* b (zwsp)
a * * b (nbsp) -> whitespace around: It’s not perfect, adding that space, as expressed before. a b? |
Beta Was this translation helpful? Give feedback.
-
I stumbled upon this thread trying to figure out how to deal with whitespace in inline HTML elements (created in some wysiwyg editor) when converting it to markdown. I'm running this:
... through
Which doesn't work as markdown:
Similarly if So unlike @gschlager I didn't get it from parsing markdown but from parsing HTML to hast to mdast to markdown, still leading to the same invalid output. I'm trying to figure out if I can somehow "sanitize" the whitespace characters to lift them out of the inline elements to prepare for the markdown transformation but I haven't figured it out yet. |
Beta Was this translation helpful? Give feedback.
-
I initially tried to find a solution based on the solution by @danburzo but found out that there were some edge cases in which it also failed. So I wrote my own recursion algorithm that bubbles the space up to the root node and also took care of a number of other possibilities where the translation can fail. If this is a valid solution I would like to contribute this code to the library where the translation is happening. Any help would be welcome const cleanUpSpaces = () => {
return /** @param {import('hast').Nodes | import('mdast').TopLevelContent} htmlTree */ (htmlTree) => {
/**
* @param {import('hast').Parent} node
* @param {import('hast').Parent | null} parent
* @param {Number} index
* @returns
*/
const visitNode = (node, parent, index) => {
// text nodes will not have any children property, so this will do an early return for all such tags.
if (!node.children) {
return;
}
/**
* if strong, del or em tags doesn't have any children(children array is present,
* but is an empty array),remove them as it can cause the translation to break.
* eg: <p><strong></strong></p> -----> <p></p>
*/
if (node.children.length === 0) {
parent?.children.splice(index, 1);
return;
}
/**
* Traversing in reverse order because the visitNode function will change the array in which the node is present,
* and in the case of leading spaces, the space gets extracted and inserted at the index of the node and the node gets displaced
* into the next place, so in the case of forward iteration, this node is again processed.
*
* Also in the case when one node is deleted from the array, the next node will take it's place, and in the case
* of forward iteration, this sibling node can get omitted.
*/
for (let i = node.children.length - 1; i >= 0; i--) {
const child = node.children[i];
visitNode(child, node, i);
}
if (node.type !== 'strong' && node.type !== 'emphasis' && node.type !== 'delete') {
return;
}
const firstChild = node.children[0];
const lastChild = node.children[node.children.length - 1];
if (firstChild.type === 'text') {
/**
* Looking for leading spaces:
* <p><strong><text> Hello</text></strong></p> -----> <p><text> </text><strong><text>Hello</text></strong></p>
*/
const [, leadingSpaces, textValue] = firstChild.value.match(/^(\s+)(.*?)$/) || [];
if (leadingSpaces && leadingSpaces.length > 0) {
firstChild.value = textValue;
parent?.children.splice(index, 0, {
type: 'text',
value: leadingSpaces,
});
// index is incremented because the space was inserted in the place of the node, so the index of the node will change.
index += 1;
}
}
if (lastChild.type === 'text') {
/**
* Looking for trailing spaces:
* <p><strong><text>Hello </text></strong></p> -----> <p><strong><text>Hello</text></strong><text> </text></p>
*/
const [, textValue, trailingSpaces] = lastChild.value.match(/^(.*?)(\s+)$/) || [];
if (trailingSpaces && trailingSpaces.length > 0) {
lastChild.value = textValue;
parent?.children.splice(index + 1, 0, {
type: 'text',
value: trailingSpaces,
});
}
}
/**
* whenever a strong/del/em node has a single text node as a child and the value of the text node is an empty string,
* remove the node.
* eg: <p><strong><text></text></strong></p> ----> <p></p>; This can happen when there was a whitespace in the text
* node, which was identified as a leading space and was extracted outside.
*/
if (firstChild === lastChild && firstChild.type === 'text' && firstChild.value === '') {
parent?.children.splice(index, 1);
}
return;
};
visitNode(htmlTree, null, 0);
};
}; |
Beta Was this translation helpful? Give feedback.
Hey again @gschlager! Sooo, it took me hours and hours of thinking but I managed to reproduce your AST in markdown 😅
->
foo this is bold bar
Pretty doubtful that anyone would ever write that though. But even before I figured it out I already become more understanding of the use cases you mention, where folks are working on the ASTs and injecting punctuation/whitespace/whatever in text or adding/removing emphasis nodes.
We already use character references in a couple places. When things can be input:
->
…we try and output it too: