html_to_vdom breaks on whitespaces and null HTML tags #777

Archmonger · 2022-06-29T03:57:14Z

Current Situation

If a HTML string has ...

Additional whitespace at the tail of the HTML document
A null (non-terminated) HTML tag such as <hr> within the document

... then html_to_vdom will silently fail without raising an exception

See my Django IDOM PR for how I caused the script tag issue to occur at this LOC.

Proposed Actions

Whitespace issue can be fixed by simple regex and/or using str.strip().

Script tag issue needs further investigation. This might open up a can of worms to test html_to_vdom with all idom.html types.

The text was updated successfully, but these errors were encountered:

Archmonger · 2022-07-01T05:26:12Z

@rmorshea Would like you to chime in on this one too, while you still have availability.

rmorshea · 2022-07-01T06:29:28Z

I'll take a look at this over the weekend.

Archmonger · 2022-08-02T23:16:08Z

The issue I categorized as a "script tag" issue appears to be related to HTML doesn't have end tags (such as <hr>) break the HTMLParser implementation that is currently developed. It was just coincidence that my script tag was following a <hr>. I am looking into a solution.

Archmonger · 2022-08-03T04:55:56Z

Googling and testing shows me there doesn't appear to be a fool proof way of resolving this with python's built-in HtmlParser. It's technically a syntax tree parser and not a DOM tree generator. The combination of handle_starttag, handle_endtag, and handle_startendtag are not robust enough to handle these scenarios without committing some really ugly code.

In the interest of maintainability, I'm going to switch us to using lxml or BeautifulSoup using the lxml backend. A nice benefit of the switch is it technically should be much faster since lxml is C based.

rmorshea · 2022-08-03T20:39:15Z

That target parser interface of lxml seems like it maps nicely onto what we already do. I wasn't able to find something similar in bs4, but I didn't look that hard.

Archmonger · 2022-08-04T05:13:46Z

Using that interface would likely make us fall into the same pitfalls, since it doesn't provide access to the DOM tree. I've been testing out performing these transformations recursively.

Seems to work off my rudimentary test, and is a lot more readable.

from io import StringIO
from pprint import pprint
from typing import Union
from lxml import etree

my_html = """
<div id="view_to_component_middleware">
    view_to_component_middleware: Success
    <div class="inner"> this is text </div>
</div>\n

<hr>
<script>
    var dog = 'memes';
</script>"""


def _set_if_val_exists(object, key, value):
    """Sets a key on a dictionary if the value's length is greater than 0."""
    if len(value):
        object[key] = value


def html_to_vdom(html: Union[str, etree._Element], *transforms):
    """Convert an lxml.etree node tree into a VDOM dict."""
    # If the user provided a string, convert it to an lxml.etree node.
    if isinstance(html, str):
        parser = etree.HTMLParser()
        tree = etree.parse(StringIO(html), parser)
        node = tree.getroot()
    elif isinstance(html, etree._Element):
        node = html
    else:
        raise TypeError("html_to_vdom expects a string or lxml.etree._Element")

    # Convert the lxml.etree node to a VDOM dict.
    vdom = {"tagName": node.tag}
    _set_if_val_exists(vdom, "attributes", dict(node.items()))
    _set_if_val_exists(
        vdom, "children", [html_to_vdom(child) for child in node.iterchildren(None)]
    )

    # Apply any transforms.
    for transform in transforms:
        vdom = transform(vdom)

    return vdom


pprint(html_to_vdom(my_html))

Archmonger added type-bug About something that isn't working priority-2-moderate Should be resolved on a reasonable timeline. labels Jun 29, 2022

Archmonger mentioned this issue Jul 1, 2022

View to Component reactive-python/reactpy-django#81

Merged

3 tasks

Archmonger added this to the 1.0 milestone Jul 1, 2022

rmorshea self-assigned this Jul 1, 2022

Archmonger assigned Archmonger and unassigned rmorshea Jul 10, 2022

Archmonger mentioned this issue Aug 4, 2022

Use LXML for html_to_vdom #795

Merged

3 tasks

Archmonger changed the title ~~html_to_vdom breaks on Script tag and whitespaces~~ html_to_vdom breaks on whitespaces and null HTML tags Aug 5, 2022

Archmonger closed this as completed in #795 Aug 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html_to_vdom breaks on whitespaces and null HTML tags #777

html_to_vdom breaks on whitespaces and null HTML tags #777

Archmonger commented Jun 29, 2022 •

edited

Loading

Archmonger commented Jul 1, 2022

rmorshea commented Jul 1, 2022

Archmonger commented Aug 2, 2022 •

edited

Loading

Archmonger commented Aug 3, 2022 •

edited

Loading

rmorshea commented Aug 3, 2022 •

edited

Loading

Archmonger commented Aug 4, 2022 •

edited

Loading

html_to_vdom breaks on whitespaces and null HTML tags #777

html_to_vdom breaks on whitespaces and null HTML tags #777

Comments

Archmonger commented Jun 29, 2022 • edited Loading

Current Situation

Proposed Actions

Archmonger commented Jul 1, 2022

rmorshea commented Jul 1, 2022

Archmonger commented Aug 2, 2022 • edited Loading

Archmonger commented Aug 3, 2022 • edited Loading

rmorshea commented Aug 3, 2022 • edited Loading

Archmonger commented Aug 4, 2022 • edited Loading

Archmonger commented Jun 29, 2022 •

edited

Loading

Archmonger commented Aug 2, 2022 •

edited

Loading

Archmonger commented Aug 3, 2022 •

edited

Loading

rmorshea commented Aug 3, 2022 •

edited

Loading

Archmonger commented Aug 4, 2022 •

edited

Loading