Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Total rework of Emphasis/Strong #1864

Merged
merged 15 commits into from
Feb 7, 2021

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented Dec 7, 2020

Description

  • Fixes em and strong (***〜***) #1860, Fixes Asterisks are not properly escaped #1811

  • Also fixes Commonmark/GFM examples:

    • (em & strong) 361, 387, 388, 407, 412, 415, 416, 424, 425, 442, 445, 446, 453, 455, 456, 457, 465, 466, 467, 470
    • This puts us up to 100% compatibility with commonmark specs!
  • Noticeable speedup, especially on the GFM benchmark (~8.7 sec -> ~8.1 sec, pretty consistent over 5 runs on my laptop)

What was attempted

  1. Simplify regex for em & strong, combined now into a single tokenizer
  2. When masking the src string in Lexer, also mask out escaped \* and \_ which further simplifies a lot of regex
  3. Track total opening delimiter characters vs closing characters, and ensure they match
  4. More closely follow CommonMark spec:
    • Favor <em><strong>text</strong></em> over <strong><em>text</em></strong>
    • Correct some of the "New" spec tests that had this ^ swapped incorrectly (and delete one that is redundant now)
    • Handle em/strong CommonMark rules 9-10, that left and right delimiters cannot sum to a multiple of 3, unless each is a multiple of 3
    • Handle cases with lots of extra unmatched delimiters, e.g. *text*********

Note this involves significant changes in the Lexer and Tokenizer APIs, which should be noted in the update.

The new Regex should be pretty benign compared to the earlier stuff. It literally checks for sequences of the pattern a***b, that is, runs of * or _ between a single character on each side.

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Dec 7, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/7lc52k1xy
✅ Preview: https://markedjs-git-fork-calculuschild-emstrongrework.markedjs.vercel.app

These tests look like they existed solely to cover the CommonMark examples with Strong and Em together that Marked wasn't passing because it output them backwards:  `<strong><em>` instead of `<em><strong>`.  This is no longer necessary.
// em
if (token = this.tokenizer.em(src, maskedSrc, prevChar)) {
// em & strong
if (token = this.tokenizer.emStrong(src, maskedSrc, prevChar)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would definitely be a breaking change since the tokenizers are part of the public API. Can we do this without combining them? Can we just switch the order of em and strong to get <strong><em>a</em></strong> to switch to <em><strong>a</strong></em>?

Copy link
Contributor Author

@calculuschild calculuschild Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, they kind of need to be tackled together to get the right sequence of <em><strong> to work, which isn't just a stylistic thing. Even though it renders the same as <strong><em>, the processing to get to that point also clears up several other bugs, especially regarding uneven **text***** delimiters on both sides.

Edit for clarification: processing em/strong in this way allows following more of the CommonMark specs in a "natural" way that I think will be much easier to maintain (instead of a monstrous, fiddly regex). However, this also means you don't really know if the output is going to be an em or a strong until the very end of the process (see the very end of the Tokenizer).

This might be worth putting into a v2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After researching quite a few dependants I think it should be fine to combine them since most dependants will change the renderer instead of the tokenizer. This will have to be a major bump to v2 though. I do want to get a few other breaking changes together before releasing v2 so it might be a while before I get to fully reviewing this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds appropriate. I need to review the other PRs you have waiting as well that should go out before this anyway...

There are some other changes I've seen in the issues list that I'd like to lump into a v2 bump as well.

lib/marked.js Outdated Show resolved Hide resolved
lib/marked.esm.js Outdated Show resolved Hide resolved
src/Tokenizer.js Outdated Show resolved Hide resolved
Co-authored-by: Steven <[email protected]>
@UziTech UziTech requested a review from styfle February 5, 2021 04:20
@UziTech
Copy link
Member

UziTech commented Feb 5, 2021

@styfle We should get this and #1926 released as v2 soon. This PR fixes the security issue in #1927

@UziTech UziTech changed the title Total rework of Emphasis/Strong fix: Total rework of Emphasis/Strong Feb 7, 2021
@UziTech UziTech merged commit 7293251 into markedjs:master Feb 7, 2021
github-actions bot pushed a commit that referenced this pull request Feb 7, 2021
# [2.0.0](v1.2.9...v2.0.0) (2021-02-07)

### Bug Fixes

* Join adjacent inlineText tokens ([#1926](#1926)) ([f848e77](f848e77))
* Total rework of Emphasis/Strong ([#1864](#1864)) ([7293251](7293251))

### BREAKING CHANGES

* `em` and `strong` tokenizers have been merged into one `emStrong` tokenizer
@github-actions
Copy link

github-actions bot commented Feb 7, 2021

🎉 This PR is included in version 2.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants