Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUL byte in <link> href confuses libxml2-lxml parser #459

Open
JustAnotherArchivist opened this issue Feb 27, 2021 · 0 comments
Open

NUL byte in <link> href confuses libxml2-lxml parser #459

JustAnotherArchivist opened this issue Feb 27, 2021 · 0 comments
Labels

Comments

@JustAnotherArchivist
Copy link
Contributor

Via ArchiveBot job b4cobsdfap6j2kzjo3i4jwnsx:

wpull --recursive --no-verbose --no-parent --html-parser libxml2-lxml https://www.e-gov.am/gov-decrees/item/23174/

This recurses to wonderful URLs such as https://www.e-gov.am/gov-decrees/item/23174/1clip_themedata.thmx%22%20rel=%22themeData%22%20/%3E (and it only gets worse from there).

The page contains these three <link> tags with NUL bytes (^@):

<link href="file:///C:DOCUME~1MarineALOCALS~1Tempmsohtmlclip1^@1clip_filelist.xml" rel="File-List" />
<link href="file:///C:DOCUME~1MarineALOCALS~1Tempmsohtmlclip1^@1clip_themedata.thmx" rel="themeData" />
<link href="file:///C:DOCUME~1MarineALOCALS~1Tempmsohtmlclip1^@1clip_colorschememapping.xml" rel="colorSchemeMapping" />

This only happens with the libxml2-lxml parser; the html5lib parser handles it correctly, i.e. does not extract any extra URLs.

Tested on two machines, both with Python 3.6.10. One has lxml 4.4.2 and libxml2 2.9.4 with wpull 2.0.3, the other has lxml 4.6.2 and libxml2 2.9.10 with wpull The Blocking PR 393.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant