Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html_to_vdom breaks on whitespaces and null HTML tags #777

Closed
Archmonger opened this issue Jun 29, 2022 · 6 comments · Fixed by #795
Closed

html_to_vdom breaks on whitespaces and null HTML tags #777

Archmonger opened this issue Jun 29, 2022 · 6 comments · Fixed by #795
Assignees
Labels
priority-2-moderate Should be resolved on a reasonable timeline. type-bug About something that isn't working
Milestone

Comments

@Archmonger
Copy link
Contributor

Archmonger commented Jun 29, 2022

Current Situation

If a HTML string has ...

  • Additional whitespace at the tail of the HTML document
  • A null (non-terminated) HTML tag such as <hr> within the document

... then html_to_vdom will silently fail without raising an exception

See my Django IDOM PR for how I caused the script tag issue to occur at this LOC.

Proposed Actions

Whitespace issue can be fixed by simple regex and/or using str.strip().

Script tag issue needs further investigation. This might open up a can of worms to test html_to_vdom with all idom.html types.

@Archmonger Archmonger added type-bug About something that isn't working priority-2-moderate Should be resolved on a reasonable timeline. labels Jun 29, 2022
@Archmonger Archmonger added this to the 1.0 milestone Jul 1, 2022
@Archmonger
Copy link
Contributor Author

@rmorshea Would like you to chime in on this one too, while you still have availability.

@rmorshea
Copy link
Collaborator

rmorshea commented Jul 1, 2022

I'll take a look at this over the weekend.

@rmorshea rmorshea self-assigned this Jul 1, 2022
@Archmonger Archmonger assigned Archmonger and unassigned rmorshea Jul 10, 2022
@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 2, 2022

The issue I categorized as a "script tag" issue appears to be related to HTML doesn't have end tags (such as <hr>) break the HTMLParser implementation that is currently developed. It was just coincidence that my script tag was following a <hr>. I am looking into a solution.

@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 3, 2022

Googling and testing shows me there doesn't appear to be a fool proof way of resolving this with python's built-in HtmlParser. It's technically a syntax tree parser and not a DOM tree generator. The combination of handle_starttag, handle_endtag, and handle_startendtag are not robust enough to handle these scenarios without committing some really ugly code.

In the interest of maintainability, I'm going to switch us to using lxml or BeautifulSoup using the lxml backend. A nice benefit of the switch is it technically should be much faster since lxml is C based.

@rmorshea
Copy link
Collaborator

rmorshea commented Aug 3, 2022

That target parser interface of lxml seems like it maps nicely onto what we already do. I wasn't able to find something similar in bs4, but I didn't look that hard.

@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 4, 2022

Using that interface would likely make us fall into the same pitfalls, since it doesn't provide access to the DOM tree. I've been testing out performing these transformations recursively.

Seems to work off my rudimentary test, and is a lot more readable.

from io import StringIO
from pprint import pprint
from typing import Union
from lxml import etree

my_html = """
<div id="view_to_component_middleware">
    view_to_component_middleware: Success
    <div class="inner"> this is text </div>
</div>\n

<hr>
<script>
    var dog = 'memes';
</script>"""


def _set_if_val_exists(object, key, value):
    """Sets a key on a dictionary if the value's length is greater than 0."""
    if len(value):
        object[key] = value


def html_to_vdom(html: Union[str, etree._Element], *transforms):
    """Convert an lxml.etree node tree into a VDOM dict."""
    # If the user provided a string, convert it to an lxml.etree node.
    if isinstance(html, str):
        parser = etree.HTMLParser()
        tree = etree.parse(StringIO(html), parser)
        node = tree.getroot()
    elif isinstance(html, etree._Element):
        node = html
    else:
        raise TypeError("html_to_vdom expects a string or lxml.etree._Element")

    # Convert the lxml.etree node to a VDOM dict.
    vdom = {"tagName": node.tag}
    _set_if_val_exists(vdom, "attributes", dict(node.items()))
    _set_if_val_exists(
        vdom, "children", [html_to_vdom(child) for child in node.iterchildren(None)]
    )

    # Apply any transforms.
    for transform in transforms:
        vdom = transform(vdom)

    return vdom


pprint(html_to_vdom(my_html))

@Archmonger Archmonger changed the title html_to_vdom breaks on Script tag and whitespaces html_to_vdom breaks on whitespaces and null HTML tags Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-2-moderate Should be resolved on a reasonable timeline. type-bug About something that isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants