HTML in JavaScript leads to undecoded character references in URLs #460

JustAnotherArchivist · 2021-03-15T02:13:23Z

When wpull encounters HTML inside JavaScript strings (or a JSON API), it does not decode character references on extracted URLs because it does not treat HTML in JS strings specially at all. This causes frequent & appearances in URLs. Further, if a numeric character references (&#nnn;) is involved, part of the URL is dropped entirely on parsing as everything after the hash is treated as the fragment (seen in ArchiveBot job 51nt0cax16fen2l8kv14kraon).

I'm not sure what the best strategy here is. Trying to detect whether a JS string contains HTML is probably expensive and may not be worth it. Attempting to decode char refs in JS-extracted URLs may be worth exploring though.

The text was updated successfully, but these errors were encountered:

JustAnotherArchivist added the enhancement label Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML in JavaScript leads to undecoded character references in URLs #460

HTML in JavaScript leads to undecoded character references in URLs #460

JustAnotherArchivist commented Mar 15, 2021

HTML in JavaScript leads to undecoded character references in URLs #460

HTML in JavaScript leads to undecoded character references in URLs #460

Comments

JustAnotherArchivist commented Mar 15, 2021