Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Markdown] Migrate one section of the MDN docs to Markdown #3350

Closed
wbamberg opened this issue Mar 22, 2021 · 6 comments
Closed

[Markdown] Migrate one section of the MDN docs to Markdown #3350

wbamberg opened this issue Mar 22, 2021 · 6 comments
Labels
MDN:Project Anything related to larger core projects on MDN

Comments

@wbamberg
Copy link
Collaborator

wbamberg commented Mar 22, 2021

This is the “project summary issue” (https://developer.mozilla.org/en-US/docs/MDN/Contribute/Processes/Workstream_assessment_project#overall_project_summary_issue) for the project to migrate the MDN JavaScript docs into Markdown.

RFC for the work: openwebdocs/project#25

People

@wbamberg, responsible for:

  • overall project definition and running
  • making sure decisions get made around authoring formats
  • making sure any necessary content updates get made

@fiji-flo, responsible for :

  • implementing tools to convert MDN content into Markdown to the agreed specification
  • enhancing Yari so it can render MDN pages from the Markdown.

@ddbeck, @chrisdavidmills, @Elchi3: consulted especially around questions of authoring formats and Markdown extensions.
@escattone, @peterbe: consulted especially around Yari architecture questions

Success criteria:

  • We have the MDN JavaScript (everything under https://developer.mozilla.org/en-US/docs/Web/JavaScript) docs in Markdown
  • We have a specification for the authoring format we use on MDN, including the base Markdown support and any extensions we have selected.
  • Yari is able to render the markdown as fully-featured MDN pages.

Steps/tasks:

  1. Write an initial Markdown specification. This should start with a “baseline” Markdown choice (GFM) and add details for any extra things we want to do. We should file issues for any of these that aren’t clear, where we can research and come to a decision. The output of these issues is an updated specification. The list of these extra issues would include:
    • Analyse where in the existing content we are using CSS classes and IDs, and make a recommendation for what to do about these in Markdown.
    • Analyse where in the existing content we are using HTML elements not supported directly in GFM, and make a recommendation for what to do about these in Markdown.
    • Decide how to handle live samples.
    • Decide how to represent notes and warnings.
    • Decide what to do about definition lists.
    • Decide how to represent cross-references.
    • ...?
  2. Make any changes to the content for places the content isn’t compatible with the spec (e.g. live samples that refer to non-heading IDs)
  3. Implement the spec for roundtrip HTML->Markdown->HTML conversion. Consider adding linting, so the conversion can report on any outstanding content issues (for example, remaining non-heading IDs).
  4. Look at the results, and iterate on the spec as needed
@wbamberg wbamberg added the needs triage Triage needed by staff and/or partners. Automatically applied when an issue is opened. label Mar 22, 2021
@wbamberg wbamberg changed the title [project summary] Migrate one section of the MDN docs to Markdown [Markdown] Migrate one section of the MDN docs to Markdown Mar 24, 2021
@wbamberg
Copy link
Collaborator Author

wbamberg commented Apr 2, 2021

I've updated this issue. Before it said "migrate a substantial portion of the docs to Markdown" but left it open which portion. I've updated this to pick the JavaScript docs. We've been talking about this choice for a while, so I don't expect it to be contentious.

Reasons for this choice:

  • the JS docs are a big enough piece of MDN (~1000 pages, or about 1/10 of the en-US/web total) to feel like a significant step, and to force us to resolve most of the issues we'll face in migrating all the docs
  • but the docs are in pretty good shape and quite consistent, so they'll probably be easier to migrate than some other areas of the site

@wbamberg
Copy link
Collaborator Author

wbamberg commented Apr 29, 2021

Specifying MDN to Markdown conversion

This document describes how we'll convert MDN's HTML content into Markdown. It's focused on the JavaScript docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript) because converting that is our immediate goal: however, it should be useful for converting more doc sets in the future.

It tries to take a systematic approach to conversion by listing:

  • every HTML element
  • every HTML attribute
  • every value for the class attribute encountered

and deciding what the conversion process needs to do when it sees that item.

The full details are in this Google sheet: https://docs.google.com/spreadsheets/d/1Nb-WUHveeUfi5YV0-pzVyHI1vR1IC8xF40IdkiceyQQ/edit#gid=0 . This document describes the spreadsheet, summarises the results, and provides more details on the choices it lists.

The spreadsheet has one page for elements, one for attributes, one for values of the class attribute. Each page has four columns:

  • Name: the name of the element/attribute/class in question
  • Conversion: what the conversion process should do when it encounters this element/attribute/class
  • In JS docs?: whether this element/attribute/class even occurs in the JS docs. Note that this is redundant for class: because this can be anything, I've only listed values which actually occur in the JS docs.
  • Issue: link to the GitHub issue where we are discussing what to do about this item

Generic conversion categories

Cells in the "Conversion" column list one of a few different generic categories, which we'll describe here.

  • GFM: This is the easy category, where an item has a direct representation in GFM. This applies to things like <p>, <li>, <img src=...>, <pre class="js"> and so on. In the sheet I've highlighted these in a soothing avocado colour.

  • Error: This means: if we see this item, the content is not yet ready for conversion. It needs people to fix the content so this item no longer appears in the source. So the conversion process needs to log an error and we need to address it.

    We should choose this category when we don't want to support this item in our Markdown source, but we can't just remove it automatically, because this will probably break the content. The style attribute is a good example.

  • Strip tags/strip attribute: This means: throw away the tag/attribute, but keep the contents of the tag.

    For example, a <span> element with no attributes isn't adding anything that we want to capture in the converted markup. Sometimes these choices remove semantics from the markup: for example the sheet recommends that we discard <abbr> tags. So to make this choice means we accept that loss.

    Often it's hard to choose between this choice and "Error": it's a matter of judgement whether we should make it a manual change to decide what to do about a tag rather than silently remove it.

  • Keep original: This means: don't convert the source, just transfer the tag and its contents as-is.

    We should choose this when we do want to keep the original feature, but don't have a sensible way to represent it in Markdown. MathML and SVG are good examples here.

  • GFM XYZ: This means: treat this as a different but related element XYZ, that has a GFM representation, and emit the GFM representation of that related element.

    Compared with just "GFM", this is a bit dodgy, because we're generally throwing away some semantics. But we have no way to represent these semantics anyway in our target format, so we don't have a better option.

    Something like <dfn> is an example of this: we might choose to use the GFM syntax for <em>, because that matches how browsers typically render <dfn>. But we lose the semantics.

For attributes especially, we sometimes combine these, because the resolution is different depending on what the element is. For example src can be converted to GFM when it's attached to an <img>, but is an error otherwise.

Custom conversion categories

Note/warning/callout

If an element has class="note":

If an element has class="warning":

If an element has class="callout":

Tables

The short story about tables is: if we can represent it using the GFM table syntax, we will, otherwise we will use HTML (https://developer.mozilla.org/en-US/docs/MDN/Contribute/Markdown_in_MDN#tables).

For the converter, this means: if the table contains any features that would prevent it being represented in GFM, then leave it unchanged, otherwise convert it to GFM.

Features that would prevent it being represented in GFM include:

  • tables with no headers
  • tables with column headers
  • table cells that contain block elements
  • tables that use any elements beyond <table>, <tr>, and <th>, and <td>.
  • tables whose elements use any table attributes, like colspan, rowspan, or scope.

One exception is <caption>: if a table would otherwise be convertible to GFM, except that it has a caption, then the converter can remove the caption element.

Hidden class

If an element has class="hidden":

  • if it's a div with an ID, and contains code blocks
    • if it contains only code blocks, transfer the "hidden" class to the code blocks, and discard the outer div element (but keep the code blocks)
    • otherwise, raise an error: this is an unconvertible structure and needs the humans to sort it out manually
  • otherwise, discard the element and its contents

The rationale for this is that we expect class="hidden" to be used in two contexts:

  • as hidden prose, where it represented instructions to editors in the old Wiki
  • as a way to hide parts of a live sample but still have them accessible to the live sample system

For the first case we just want to remove the hidden prose.

For the second, case, we can still hide code blocks in the live sample system via the hidden attribute on the block's info string. However, if people are hiding other elements that might participate in live samples, like headings, we probably need to update the content to make it reliably convertible.

summary/seoSummary

For the summary and seoSummary classes:

  • if the text content of the element which has the summary or seoSummary set matches the text content of the first prose paragraph of the document, then remove the classes.
  • otherwise raise an error

One complication is "first prose paragraph": many documents start with an element containing only macro calls, like:

<div>{{CSSRef}}</div>

Usually this is a <div>, but sometimes people have used <p>. So the converter should use some heuristic like "first paragraph element that does not start with {{".

Description lists

We've invented our own syntax for <dl> that's based on the syntax for <ul>: https://developer.mozilla.org/en-US/docs/MDN/Contribute/Markdown_in_MDN#definition_lists, also see #4367 .

However, this is more restrictive than the HTML <dl>, in that it expects a <dl> to contain only pairs of <dt>/<dd> elements. So:

Unresolved elements/attributes/class values

Not all items have a category selected. This means we're still not sure what to do about them. All these items should have a link to a GitHub issue in which we can work out what to do with the item.

Once the issue has reached a consensus we can assign a category to the corresponding items and close the issue. The resolution might of course be to invent a new category: for example we might have "Extend GFM" for something like <dl>.

Currently the following groups of items are unresolved:

  • classes
    • fullwidth-table, standard-table: not yet tracked by an issue

Once all items are resolved here, we have a complete plan for converting the source into Markdown.

As a practical matter, we only need to have a resolution now for items that appear in the JS docs.

@sideshowbarker
Copy link
Member

I propose we move this to the Discussions tracker.

@wbamberg
Copy link
Collaborator Author

wbamberg commented Jun 8, 2021

I would like to keep this here. This is an issue, not a discussion: it exists to track some work we want to get done, not to discuss what to do. I do wish GH was better at representing "project issues" like this but think issues are a better fit than discussions. I agree on all the "decide what to do" ones though.

@sideshowbarker
Copy link
Member

This is an issue, not a discussion: it exists to track some work we want to get done, not to discuss what to do. I do wish GH was better at representing "project issues" like this but think issues are a better fit than discussions.

OK, yup — makes sense

@wbamberg
Copy link
Collaborator Author

Closed by #7092 .

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
MDN:Project Anything related to larger core projects on MDN
Projects
None yet
Development

No branches or pull requests

3 participants