Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML ➡️ Markdown: Web/JavaScript #5193

Closed
wants to merge 1 commit into from
Closed

HTML ➡️ Markdown: Web/JavaScript #5193

wants to merge 1 commit into from

Conversation

Gregoor
Copy link
Contributor

@Gregoor Gregoor commented May 21, 2021

This is still a work in progress, but I wanted to give people the opportunity to look at it early.

If you want to preview it locally all you have to do is check-out this branch and run

yarn start

That should be it!


If you want to reproduce this PR:

First you have to set-up yari. It may be advisable to point your Yari at a different clone of mdn/content from the one you use for writing work.

Then you can run:

cd wherever-you-have-stored-yari/
yarn
yarn md h2m web/javascript --mode=replace

Then you will see the converted .md files in your mdn/content tree.

(Note that --mode=replace tells the tool to replace .html files with the .md files. Otherwise it just adds the .md files. You probably want replace, because it is closer to the final state and may show up extra problems that could be hidden by the presence of the old .html files.)

To see the rendered pages, run the following to get the preview server running on localhost:3000 (it takes some time to start up):

yarn dev

Conversion Report

The partner PR in yari, which adds markdown conversion rule and overall support for it is here:
mdn/yari#3843

I will try to sync this PR with both converter and content changes every ~other work day.

@Gregoor
Copy link
Contributor Author

Gregoor commented Jun 4, 2021

This branch is now deployed here:
https://h2m-js.content.dev.mdn.mozit.cloud

@@ -29,7 +29,7 @@ includes(searchElement, fromIndex)
### Parameters

* `searchElement`
* : The value to search for.
* : The value to search for. Hi Will!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gregor :)

comfortable with HTML and CSS. You may have to start small, and progress
gradually. To begin, let's examine how to add JavaScript to your page for
creating a *Hello world!* example. (*Hello world!*
is[ the standard for introductory programming examples](https://en.wikipedia.org/wiki/%22Hello,\_World!%22\_program).)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link works but looks wrong: the standard for introductory programming examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not realized this PR also includes files under /en-us/learn/. These should be excluded (the scope of this is /en-us/web/javascript)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a faithful conversion of the broken original, isn't it?

<p>However, getting comfortable with JavaScript is more challenging than getting comfortable with HTML and CSS. You may have to start small, and progress gradually. To begin, let's examine how to add JavaScript to your page for creating a <em>Hello world!</em> example. (<em>Hello world!</em> is<a href="https://en.wikipedia.org/wiki/%22Hello,_World!%22_program"> the standard for introductory programming examples</a>.)</p>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I mistook the comma in the URL as "syntax". It would be good if this was clever enough to note the leading space in the link text and put it before the link. Otherwise, please ignore - sorry

creating a *Hello world!* example. (*Hello world!*
is[ the standard for introductory programming examples](https://en.wikipedia.org/wiki/%22Hello,\_World!%22\_program).)

<div class="warning">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning not converted properly. I guess you guys trusted "notecard warning".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this was in "en-us/learn" I didn't prep the content for it, but we should not see this un en-us/web/javascript .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I have yet to fix FOLDERSEARCH to be less fuzzy. Like Will said, /learn is an accidental inclusion!


<img alt="" src="hello-world.png" style="display: block; margin: 0px auto" />

<div class="note">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note not converted properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has an extra <p> wrapping its children, we've aimed to automatically cover the bulk of notecard cases without adding too many extra rules for outliers which can be fixed by hand

alternate the display of one of two images. This change will happen as a user
clicks the displayed image.

1. Choose an image you want to feature on your example site. Ideally, the image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably these should be auto-numbered. I.e. in GFM you can number all of these as 1. and they render properly. Also you don't need the spaces between lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! Me, personally, not a tech writer per-se, I like both, the 1-numbered for making it easy to add items in between and the correctly-numbered for matching the output. I'd defer this decision to all of you and Will.
Unfortunately Prettier, which we use to format the markdown after the conversion, does not have a configuration for lists. It does have interesting behavior though in that it makes lists consistent, check out this example and see how the output changes when you remove the first element.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbamberg This is a question for us. I have a sneaking preference for auto-numbered lists, and that does generally reflect how the original HTML works. Make authoring easier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed "correct" numbering, like:

1 one
2 two
3 three

...just because it makes the source files easier to read, and reading happens more than writing. But I can see the other side too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the reading happens on the rendered output mostly :-).

I tend to like manually numbered when I'm writing and I need to refer back to a particular item. On the other hand, moving things around is much easier if they aren't numbered - and in if you do reference things, you have to change the number anyway.

I have a preference for autonumbered, but perhaps we should create a separate discussion.


## See also

* [Dynamic client-side scripting with JavaScript](/en-US/docs/Learn/JavaScript)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a definition list? Seems to be too much spacing following bullets

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is and you are right, the spacing is peculiar and throwing this markdown in prettier's playground has a different, properyl spaced output. Strange. I shall investigate!

@Gregoor
Copy link
Contributor Author

Gregoor commented Jun 10, 2021

Thank you for the comments Hamish!

@Rumyra
Copy link
Collaborator

Rumyra commented Jun 10, 2021

  • I like the md syntax - seems common syntax
  • What's with newlines? Is this for github? I'm pro that for reviewing
  • Maybe some guidelines on editor preferences to keep tabs/spaces consistent - I don't know if that's helped by the linter btw
    • Related: There's a few <0xa0> dotted around code examples
    • Related: Reading Hamish's comments, it would be good to have * thing instead of * thing (github has removed the spaces here, but basically one space after a list identifier)
  • How are we dealing with both md & html files being present - I assume we'll remove the html file
  • I like the dl syntax - clever

@wbamberg
Copy link
Collaborator

wbamberg commented Jun 10, 2021

Thanks for the comments!

  • What's with newlines? Is this for github? I'm pro that for reviewing

I don't understand this, could you elaborate?

  • Maybe some guidelines on editor preferences to keep tabs/spaces consistent - I don't know if that's helped by the linter btw

I'm thinking we will (and are in fact) running Prettier over all this, so we don't care what people do. But I guess it would be advisable for people to use Prettier locally as well?

Which makes me think - if people are running Prettier locally will we have a problem with the table-formatting hack we are adding to work around prettier/prettier#10950 ?

  • Related: There's a few <0xa0> dotted around code examples

Could you point to an example?

  • Related: Reading Hamish's comments, it would be good to have * thing instead of * thing (github has removed the spaces here, but basically one space after a list identifier)

  • How are we dealing with both md & html files being present - I assume we'll remove the html file

Do you mean, HTML and MD versions of the same page? I think that won't happen: when we convert we'll remove the HTML version.

---
{{jsSidebar("Operators")}}

The **`function*`** keyword can be used to define a generator function inside an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I wonder if it is worth doing a search for *, **, _ used outside of pre or code blocks in our html. Of course you probably already addressed that.

@hamishwillee
Copy link
Collaborator

@Gregoor FWI did a much bigger check of the source and didn't find anything else. For me the unnecessary spaces after list times are the main "dealbreaker". Upshot - impressive conversion.

@wbamberg
Copy link
Collaborator

@Gregoor FWI did a much bigger check of the source and didn't find anything else. For me the unnecessary spaces after list times are the main "dealbreaker". Upshot - impressive conversion.

We very much appreciate you taking the time to look, Hamish!

@Gregoor
Copy link
Contributor Author

Gregoor commented Jun 14, 2021

Thank you for the feedback everyone! I've updated the converter and this PR, /learn is now not part of this (as intended) and the list-item spacing issue should be resolved. There is some more words on what the issue was in this commit message. I read through a number of files to check whether it caused new regression but fortunately came up empty so far. But if you see something let me know.

@Gregoor
Copy link
Contributor Author

Gregoor commented Jun 14, 2021

Oh and we merged the converter into yari-main, so now all you need to do to preview it locally is check-out this branch and run yarn start!

@wbamberg
Copy link
Collaborator

wbamberg commented Jun 18, 2021

I've been digging through the report today. I was hoping to go through it all but ran out of time. But I noticed that sometimes, when the converter decides it can't convert something and wants to leave it in HTML, the output gets mangled. For example:

I will finish going through the whole report tomorrow but thought it was worth mentioning these now.

I also think we could choose different strategies for many of these. For example, we should probably convert dfnto em, and we should be able to discard <span class="seoSummary"> whenever it contains the entire first paragraph, which is always for JS. But let's talk about that separate from the conversion errors, and once I've had a look through the report.

@wbamberg
Copy link
Collaborator

wbamberg commented Jun 19, 2021

I had a good look at the conversion report: https://github.com/mdn/content/blob/h2m-js/md-conversion-problems-report.md.

First, I think the report logs two sorts of problems:

  • Missing conversion rules: this is where the converter doesn't know how to convert an element, because we didn't tell it what to do. Sometimes this is fine, because that's what we wanted - we don't have a GFM representation, so we want HTML. Sometimes we might want to treat this message as signalling an error, because it signals some content that we would like to clean out of the HTML before conversion (that is, we might want to define a rule for it). (Side-note: I think the converter logs this as an H3 ### Missing conversion rules when it should log it as H4, so it lives under the heading for the page affected.)

  • Invalid AST transformations: this is where the converter tried to convert an element but failed. Almost always this is because of unconvertible tables. We know we can convert some but not all tables, so we always try, and if the table turns out to be unconvertible we log this error. We might also see it whenever our Markdown is more limited than the HTML - for example our dl, which doesn't allow for multiple consecutive dt or multiple consecutive dd elements. But also it might be a signal that some element we expect to be converted failed conversion.

In both these cases the converter handles it by not converting the HTML, and keeping it as-is. In general if there aren't too many of these cases, we don't mind a little "unplanned" HTML creeping in. But it's worth looking through the report in case there are unexpected problems or cases where we can tighten things up.

I've been through all the types in the summary at the top of the report. I've referred a few times in the analysis below to the conversion spreadsheet which is at https://docs.google.com/spreadsheets/d/1Nb-WUHveeUfi5YV0-pzVyHI1vR1IC8xF40IdkiceyQQ/edit#gid=1365998303, and which refers to #3350 (comment) sometimes where the conversion is more nuanced. If possible I'd like that spreadsheet to be a reasonably definitive guide for how we want to handle conversion to Markdown.

Tables

  • tr (177)
  • td (162)
  • th[scope="row"] (43)
  • table.standard-table (39)
  • th (17)
  • table (6)
  • table.fullwidth-table (4)
  • th[colSpan="4"] (2)
  • td[rowSpan="4"] (1)

There are a lot of these and it's hard to check them all in detail, but all the ones I checked are "Invalid AST transformations" at the top level (table), and "Missing conversion rules" for subcomponents like th, that GFM doesn't understand. Which suggests that this is caused by tables that are not convertible to GFM: we try to convert them, fail, and log an "Invalid AST" error.

This seems fine: we expect that some tables are not convertible to GFM and will need to stay as HTML (https://developer.mozilla.org/en-US/docs/MDN/Contribute/Markdown_in_MDN#tables). It looks like 49 tables were unconvertible, which is almost exactly half the tables in the JS docs. Given how limited GFM table syntax is, this isn't surprising.

dfn

These are "Missing conversion rules".

The conversion sheet gives a conversion of "GFM em?" for the dfn element, although the "?" implies we are not sure about it. I think we should basically certify that decision, and agree to convert <dfn>thing</dfn> to:

 `thing` 

Note that the conversion of one of these dfn elements produces bad Markdown output, as noted in #5193 (comment).

span.seoSummary

These are "Missing conversion rules".

The conversion sheet points to "summary/seoSummary" in #3350 (comment), which basically says we should strip out this element when its text content matches the first paragraph of the doc, and log an error otherwise. See also #3923.

But if this is too complicated we could leave these as unconverted, and deal with this in content.

Note that the conversion of some of these span.seoSummary elements produces bad Markdown output, as noted in #5193 (comment).

kbd

These are "Missing conversion rules".

We haven't written down what to do about this yet. I think we should keep them as HTML.

  • we don't have a good Markdown alternative
  • they don't appear often and in quite precise circumstances
  • as inline elements they don't "leak out" into the surrounding page, unlike things like <table> or <dl>

So this should stay as "Missing conversion rules".

span.blob-code-inner.blob-code-marker

These are "Missing conversion rules".

This should be stripped out (captured by "anything else" in the "classes" tab of the conversion spreadsheet).

sub

These are "Missing conversion rules".

Fine for this to stay HTML, per https://developer.mozilla.org/en-US/docs/MDN/Contribute/Markdown_in_MDN#superscript_and_subscript.

So this should stay as "Missing conversion rules".

figure

This got added https://github.com/mdn/content/pull/4762/files, which explains why we didn't define a rule for it :(. I think we should remove this and use non-semantic markup in cases like this. But I'll do this manually in a content PR.

Update: these got removed in #6128.

p.summary

Same as for span.seoSummary.

code

These are "Invalid AST transformations".

This looks like a bug? In both these cases we're getting mangled output. I reported them at #5193 (comment).

pre.brush:.js.highlight:.[5]

These are "Missing conversion rules".

According to #3512 and the conversion spreadsheet under the catch-all "(anything else)" in the "classes" tab, we should strip this attribute.

dl

These are "Invalid AST transformations".

These are two cases where the converter didn't think it could convert the markup, so kept it as HTML. The HTML output looks OK and I don't mind a couple of dl elements not getting converted, but I wonder why this is happening, they look OK to me.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Equality_comparisons_and_sameness
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Generator/throw

abbr[title="ECMAScript 5th edition"]

These are "Missing conversion rules".

Should be stripped out according to the conversion spreadsheet.

@Gregoor
Copy link
Contributor Author

Gregoor commented Jun 21, 2021

Thank you so much Will, this is great! I'll address the points below

Invalid Markdown/HTML

It looks like we have two prettier-related problems. One was easy to fix, namely I was keeping the newline it was adding to prettified HTML, which busted some markdown nodes. Fixing that one fixed these pages:

  • Guide/Indexed_collections
  • Reference/Global_Objects/Object/defineProperty
  • Reference/Global_Objects/Function/arguments

But then there is a more gnarly one where prettier once again has interesting ideas wrt how to format nodes, see here. I am now wondering if we want to move the prettier pendulum a bit back again and only format HTML tables as these seem to not cause problems and with these weird cases I'm really at a bit of a loss right now. I could try another quick-fix a la what we did for the multi-line closing tag (https://github.com/mdn/yari/blob/b0dbaed4bc4135b51217400f750179b4a3bebc28/markdown/h2m/utils.ts#L38-L40) but I think that would be trickier here as it would also need to account for elements with attributes. So long story short, my recommendation is we'd limit prettifiying HTML to <table>s, with all the other HTML staying as potentially-less-wieldy one-liners. Wdyt?

dfn

<dfn> will now become inlineCode nodes aka ``

span.seoSummary

It looks like the toText implementation we are using differs from what our summary-extraction is doing, so there are a lot of conversions not going through. I am looking at that now and hope to fix most of those I saw (Update: I've started working on it over in their repo) . There will still be some "missing conversion rules" for summaries remaining, where the extracted summaries do not match. The error "missing rule" is a bit of a misnomer, hope that is not too confusing!

.blob-code-*

Will now be stripped.

code

After some thinking I think this is what we decided on (not the broken markup, that should be fixed with the prettier workaround mentioned above). Let's look at the first one which is:

<code><strong><em>function</em>.arguments</strong></code>

This could be turned into markdown code like this:

**_`function`_`.arguments`**

Which would require changing the nesting. That is something we do for simple cases like <code><strong>function</strong></code> but the given examples are a bit more tricky (multiple children and nesting). I'd rather do these manually after converting than add rules. Though if we have a lot more of these cases in the other content it might be worth reconsidering.

pre.brush:.js.highlight:.[5]

Added stripping for these cases!

abbr

Added a rule for stripping those tags away (and ignoring their title).


Thank you again Will, I think this is getting more and more refined. I will create and deploy a new build once I've fixed the summary-issue and merged the PR with all the fixes into yari.

@Gregoor Gregoor mentioned this pull request Jun 21, 2021
@wbamberg
Copy link
Collaborator

Thank you, Gregor!

Invalid Markdown/HTML

It looks like we have two prettier-related problems. One was easy to fix, namely I was keeping the newline it was adding to prettified HTML, which busted some markdown nodes. Fixing that one fixed these pages:

* Guide/Indexed_collections

* Reference/Global_Objects/Object/defineProperty

* Reference/Global_Objects/Function/arguments

But then there is a more gnarly one where prettier once again has interesting ideas wrt how to format nodes, see here. I am now wondering if we want to move the prettier pendulum a bit back again and only format HTML tables as these seem to not cause problems and with these weird cases I'm really at a bit of a loss right now. I could try another quick-fix a la what we did for the multi-line closing tag (https://github.com/mdn/yari/blob/b0dbaed4bc4135b51217400f750179b4a3bebc28/markdown/h2m/utils.ts#L38-L40) but I think that would be trickier here as it would also need to account for elements with attributes. So long story short, my recommendation is we'd limit prettifiying HTML to <table>s, with all the other HTML staying as potentially-less-wieldy one-liners. Wdyt?

So to be clear this would still prettify Markdown, but only prettify HTML in tables?

I think this is a good idea. I don't expect there will be a lot of HTML apart from tables, and most of it will be inline (sub and sup and so on) so definitely the main benefit of prettifying HTML is for the tables. And it sounds like this will be a lot safer than trying to fix this problem in every place it appears.

span.seoSummary

It looks like the toText implementation we are using differs from what our summary-extraction is doing, so there are a lot of conversions not going through. I am looking at that now and hope to fix most of those I saw (Update: I've started working on it over in their repo) . There will still be some "missing conversion rules" for summaries remaining, where the extracted summaries do not match. The error "missing rule" is a bit of a misnomer, hope that is not too confusing!

That's fine. As long as we can look in the report to find summaries that didn't get converted because of a no-match, then we can fix those in content.

I did notice that the report listed one case where the summary would not match, in new.target - I thought about fixing it in content but then thought it might be better to leave it in for a while so it could be a test.

code

After some thinking I think this is what we decided on (not the broken markup, that should be fixed with the prettier workaround mentioned above). Let's look at the first one which is:

<code><strong><em>function</em>.arguments</strong></code>

This could be turned into markdown code like this:

**_`function`_`.arguments`**

Which would require changing the nesting. That is something we do for simple cases like <code><strong>function</strong></code> but the given examples are a bit more tricky (multiple children and nesting). I'd rather do these manually after converting than add rules. Though if we have a lot more of these cases in the other content it might be worth reconsidering.

You are right of course! We should keep these as HTML. We can fix them in content if necessary. But anyway 2 cases out of 1000 pages is not worth worrying about IMO.

I should update the docs for this case.

@wbamberg
Copy link
Collaborator

Closed in favour of #7092.

@wbamberg wbamberg closed this Jul 21, 2021
@Josh-Cena Josh-Cena deleted the h2m-js branch July 7, 2022 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants