Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards #19727

Merged

Conversation

kevinansfield
Copy link
Member

@kevinansfield kevinansfield commented Feb 21, 2024

closes ENG-627

We were using cheerio to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a long-standing issue (that we've had to deal with before) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a data- attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value.

  • swapped the implementation that uses cheerio for one that uses html5parser to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions.
    • fixes the quote change bug
    • uses tokenization directly to avoid cost of building a full AST
  • updated Content API Posts snapshot
    • one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied)
    • the link replacer no longer converts attr="" to attr (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html)
  • added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path
    • new implementation has a 3x performance improvement
    • the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference

Benchmark results comparing implementations:

❯ node test/benchmark.js

LinkReplacer
├─ cheerio: 5.03K /s ±2.20%
├─ html5parser: 16.5K /s ±0.43%

Completed benchmark in 0.9976526670455933s
┌─────────────┬─────────┬────────────┬─────────┬───────┐
│   (index)   │ percent │ iterations │ current │  max  │
├─────────────┼─────────┼────────────┼─────────┼───────┤
│   cheerio   │   ''    │ '5.03K/s'  │  5037   │ 5037  │
│ html5parser │   ''    │ '16.5K/s'  │  16534  │ 16534 │
└─────────────┴─────────┴────────────┴─────────┴───────┘

@kevinansfield kevinansfield force-pushed the fix-single-quote-conversion branch from 0b184b6 to d80b781 Compare February 21, 2024 17:30
closes ENG-627

Cheerio's serializer has a long-standing issue where it replaces single-quote attributes with double-quote attributes. We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value.

- swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions.
  - fixes the quote change bug
  - uses tokenization directly to avoid cost of building a full AST
- updated Content API Posts snapshot
  - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied)
  - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html)
- added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path
  - new implementation has a 3x performance improvement
  - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference

Benchmark results comparing implementations:

```
❯ node test/benchmark.js

LinkReplacer
├─ cheerio: 5.03K /s ±2.20%
├─ html5parser: 16.5K /s ±0.43%

Completed benchmark in 0.9976526670455933s
┌─────────────┬─────────┬────────────┬─────────┬───────┐
│   (index)   │ percent │ iterations │ current │  max  │
├─────────────┼─────────┼────────────┼─────────┼───────┤
│   cheerio   │   ''    │ '5.03K/s'  │  5037   │ 5037  │
│ html5parser │   ''    │ '16.5K/s'  │  16534  │ 16534 │
└─────────────┴─────────┴────────────┴─────────┴───────┘
```
@kevinansfield kevinansfield force-pushed the fix-single-quote-conversion branch from d80b781 to d11f3a7 Compare February 21, 2024 18:02
Copy link

codecov bot commented Feb 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8be7f7c) 72.60% compared to head (d11f3a7) 72.60%.
Report is 15 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #19727   +/-   ##
=======================================
  Coverage   72.60%   72.60%           
=======================================
  Files        1255     1255           
  Lines       73299    73302    +3     
  Branches     9740     9740           
=======================================
+ Hits        53219    53224    +5     
+ Misses      19326    19324    -2     
  Partials      754      754           
Flag Coverage Δ
admin-tests 39.88% <ø> (+<0.01%) ⬆️
e2e-tests 81.61% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@SimonBackx SimonBackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change!

@kevinansfield kevinansfield merged commit b704530 into TryGhost:main Mar 6, 2024
22 checks passed
@kevinansfield kevinansfield deleted the fix-single-quote-conversion branch March 6, 2024 09:11
allouis added a commit that referenced this pull request Mar 8, 2024
* upstream/main: (32 commits)
  Update dependency i18next to v23.10.1
  Fixed udpated banner z-index (#19822)
  Update dependency jose to v4.15.5
  Update dependency express to v4.18.3
  v5.80.1
  Fixed tiers paywall selecting all paid tiers (#19817)
  🐛 Fixed free tier showing in the tiers-only paywall in posts (#19807)
  Fixed tiers paywall selecting all paid tiers (#19817)
  🐛 Fixed free tier showing in the tiers-only paywall in posts (#19807)
  Fix a French translation error in portal.json (#19803)
  Updated test to check for Unsplash button in Admin (#19814)
  Fixed button spacing in Portal unsubscribe popup footer (#19815)
  Update dependency terser to v5.29.1
  Released Portal v2.37.5 (#19812)
  Update dependency terser to v5.29.0
  🎨 Improved lazy-loading of comments data (#19809)
  Fixed various Unsplash design bugs (#19806)
  🎨 Improved lazy-loading of comments data (#19809)
  🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727)
  Fixed various Unsplash design bugs (#19806)
  ...
royalfig pushed a commit that referenced this pull request Mar 25, 2024
…ds (#19727)

closes ENG-627

We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](cheeriojs/cheerio#720) (that we've [had to deal with before](TryGhost/SDK#124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value.

- swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions.
  - fixes the quote change bug
  - uses tokenization directly to avoid cost of building a full AST
- updated Content API Posts snapshot
  - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied)
  - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html)
- added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path
  - new implementation has a 3x performance improvement
  - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference

Benchmark results comparing implementations:

```
❯ node test/benchmark.js

LinkReplacer
├─ cheerio: 5.03K /s ±2.20%
├─ html5parser: 16.5K /s ±0.43%

Completed benchmark in 0.9976526670455933s
┌─────────────┬─────────┬────────────┬─────────┬───────┐
│   (index)   │ percent │ iterations │ current │  max  │
├─────────────┼─────────┼────────────┼─────────┼───────┤
│   cheerio   │   ''    │ '5.03K/s'  │  5037   │ 5037  │
│ html5parser │   ''    │ '16.5K/s'  │  16534  │ 16534 │
└─────────────┴─────────┴────────────┴─────────┴───────┘
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants