User documentation? #489

davaya · 2020-05-06T20:47:52Z

Is there any tutorial documentation for this package? Something that would answer questions like when parsing an HTML file:

with open(fname, 'rb') as f:
    doc = html5lib.parse(f)
for e in list(doc[0]):
    print(e.tag)

Why does the result look like this?

{http://www.w3.org/1999/xhtml}meta
{http://www.w3.org/1999/xhtml}title
{http://www.w3.org/1999/xhtml}link

A naive user would expect tag to return the literal value contained in the HTML element, not the tag prefixed with a qualifier of some sort. It would be helpful to have a document that explains why the prefix has been injected and how to configure the library to return unadorned tags.

The text was updated successfully, but these errors were encountered:

gsnedders · 2020-05-06T22:16:45Z

https://html5lib.readthedocs.io/en/latest/ has docs, though it doesn't answer the why. https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.html5parser.HTMLParser specifically mentions the namespaceHTMLElements arg.

The why is quite simple: it's what the HTML spec says (that almost all elements get inserted in the HTML namespace from HTML); you can see browsers do this via document.documentElement.namespaceURI on this page.

davaya · 2020-05-06T22:46:22Z

Thanks - creating a parser object explicitly with the proper args gives the desired results. New users reading the overview might benefit from some rationale (pros and cons) for choosing a particular tree type, and a note pointing out that namespacing is one of the options that can be specified.

It could be argued that new users might not be aware of namespacing and should not get it by default, while those who do need it would know enough to opt in.

guettli · 2021-07-09T08:27:13Z

I think basic usage example would be helpful.

Example: parse html, replace the innerHTML of all <a> links to "super foo". And then write it out again.

Up to now the docs are just about parsing.

Please add some example how to process the parsed data.

guettli · 2021-07-09T08:59:43Z

Something like this would be nice to have in the docs:

from xml.etree import ElementTree
from html5lib import HTMLParser

parser = HTMLParser(namespaceHTMLElements=False)

tree = parser.parse('''
  foo
  <h1>Moonlight</h1>
  bar''')

for e in tree.findall('.//h1'):
    e.text = 'Sunshine'

print(ElementTree.tostring(etree))

Source: https://stackoverflow.com/questions/68313619/how-to-replace-the-innerhtml-of-all-h1-tags-with-html5lib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User documentation? #489

User documentation? #489

davaya commented May 6, 2020 •

edited

Loading

gsnedders commented May 6, 2020

davaya commented May 6, 2020

guettli commented Jul 9, 2021

guettli commented Jul 9, 2021 •

edited

Loading

User documentation? #489

User documentation? #489

Comments

davaya commented May 6, 2020 • edited Loading

gsnedders commented May 6, 2020

davaya commented May 6, 2020

guettli commented Jul 9, 2021

guettli commented Jul 9, 2021 • edited Loading

davaya commented May 6, 2020 •

edited

Loading

guettli commented Jul 9, 2021 •

edited

Loading