Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML in JavaScript leads to undecoded character references in URLs #460

Open
JustAnotherArchivist opened this issue Mar 15, 2021 · 0 comments

Comments

@JustAnotherArchivist
Copy link
Contributor

When wpull encounters HTML inside JavaScript strings (or a JSON API), it does not decode character references on extracted URLs because it does not treat HTML in JS strings specially at all. This causes frequent & appearances in URLs. Further, if a numeric character references (&#nnn;) is involved, part of the URL is dropped entirely on parsing as everything after the hash is treated as the fragment (seen in ArchiveBot job 51nt0cax16fen2l8kv14kraon).

I'm not sure what the best strategy here is. Trying to detect whether a JS string contains HTML is probably expensive and may not be worth it. Attempting to decode char refs in JS-extracted URLs may be worth exploring though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant