You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When wpull encounters HTML inside JavaScript strings (or a JSON API), it does not decode character references on extracted URLs because it does not treat HTML in JS strings specially at all. This causes frequent & appearances in URLs. Further, if a numeric character references (&#nnn;) is involved, part of the URL is dropped entirely on parsing as everything after the hash is treated as the fragment (seen in ArchiveBot job 51nt0cax16fen2l8kv14kraon).
I'm not sure what the best strategy here is. Trying to detect whether a JS string contains HTML is probably expensive and may not be worth it. Attempting to decode char refs in JS-extracted URLs may be worth exploring though.
The text was updated successfully, but these errors were encountered:
When wpull encounters HTML inside JavaScript strings (or a JSON API), it does not decode character references on extracted URLs because it does not treat HTML in JS strings specially at all. This causes frequent
&
appearances in URLs. Further, if a numeric character references (&#nnn;
) is involved, part of the URL is dropped entirely on parsing as everything after the hash is treated as the fragment (seen in ArchiveBot job 51nt0cax16fen2l8kv14kraon).I'm not sure what the best strategy here is. Trying to detect whether a JS string contains HTML is probably expensive and may not be worth it. Attempting to decode char refs in JS-extracted URLs may be worth exploring though.
The text was updated successfully, but these errors were encountered: