markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

dbader · 2020-10-14T03:07:54Z

I'm running into an issue with "simple" Markdown + HTML examples crashing with a TypeError. For example, the following throws a TypeError: expected string or bytes-like object exception:

markdown.markdown(
    'Hello\n<div markdown="1"></div>', 
    extensions=["markdown.extensions.md_in_html"],
)

Versions: Markdown==3.3.1, CPython 3.8.6

I haven't had time yet to investigate this further, but I wanted to get an issue opened with some failing test cases in case others are running into this as well. I suspect this has something to do with the raw HTML parser rewrite in #803.

Steps to reproduce:

Save the following as mderror.py:

import traceback
import markdown


def try_render(text):
    print(f"\n\n---- Rendering: {text!r}")
    try:
        html = markdown.markdown(
            text, extensions=["markdown.extensions.md_in_html"],
        )
        print(f"---- PASS: {html!r}")
    except Exception as e:
        print("---- FAILED")
        traceback.print_exc()


try_render('Hello<div markdown="1"></div>')  # Error
try_render('Hello\n<div markdown="1"></div>')  # Error
try_render('Hello\n\n<div markdown="1"></div>')  # Okay

Result:

$ python mderror.py


---- Rendering: 'Hello<div markdown="1"></div>'
<Element 'div' at 0x110758310>
---- FAILED
Traceback (most recent call last):
  File "mderror.py", line 8, in try_render
    html = markdown.markdown(
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 386, in markdown
    return md.convert(text)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 290, in convert
    output = pp.run(output)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 73, in run
    if self.isblocklevel(html):
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 94, in isblocklevel
    m = re.match(r"^\<\/?([^ >]+)", html)
  File "/Users/dbader/.pyenv/versions/3.8.6/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object


---- Rendering: 'Hello\n<div markdown="1"></div>'
<Element 'div' at 0x110758900>
---- FAILED
Traceback (most recent call last):
  File "mderror.py", line 8, in try_render
    html = markdown.markdown(
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 386, in markdown
    return md.convert(text)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 290, in convert
    output = pp.run(output)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 73, in run
    if self.isblocklevel(html):
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 94, in isblocklevel
    m = re.match(r"^\<\/?([^ >]+)", html)
  File "/Users/dbader/.pyenv/versions/3.8.6/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object


---- Rendering: 'Hello\n\n<div markdown="1"></div>'

---- PASS: '<p>Hello</p>\n<div></div>'

The text was updated successfully, but these errors were encountered:

facelessuser · 2020-10-14T03:32:11Z

I imagine this has to do with the new HTML handler. It looks like there may be a corner case where we think we are handing over strings, but etree elements are slipping through.

facelessuser · 2020-10-14T03:33:31Z

Hmm, looks like somehow we are stashing etree objects instead of strings....

facelessuser · 2020-10-14T04:54:38Z

It appears it gets handled by md_in_html in the initial parse, but because it is within a paragraph, it doesn't get handled later by the block processor. We can force it to not fail by always injecting new lines before it to force it to be seen later as a block

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index 3518d05..e1fcc7a 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -127,6 +127,7 @@ class HTMLExtractorExtra(HTMLExtractor):
                 if not self.mdstack:
                     # Last item in stack is closed. Stash it
                     element = self.get_element()
+                    self.cleandoc.append('\n\n')
                     self.cleandoc.append(self.md.htmlStash.store(element))
                     self.cleandoc.append('\n\n')
                     self.state = []

---- Rendering: 'Hello<div markdown="1"></div>'
---- PASS: '<p>Hello</p>\n<div></div>'


---- Rendering: 'Hello\n<div markdown="1"></div>'
---- PASS: '<p>Hello</p>\n<div></div>'


---- Rendering: 'Hello\n\n<div markdown="1"></div>'
---- PASS: '<p>Hello</p>\n<div></div>'

Now this causes a different behavior than when markdown="1" is not included as then md_in_html doesn't process it and the normal HTML logic takes over.

---- Rendering: 'Hello<div></div>'
---- PASS: '<p>Hello<div></div></p>'


---- Rendering: 'Hello\n<div></div>'
---- PASS: '<p>Hello</p>\n<div></div>'


---- Rendering: 'Hello\n\n<div></div>'
---- PASS: '<p>Hello</p>\n<div></div>'

Is this discrepancy a problem? That is probably a question for @waylan.

I guess if we wanted to keep the handling the same as when markdown="1" is not included, we need an inline handler to catch these improper md_in_html cases and just convert them to strings and strip the markdown="1". Or we need to somehow identify that this case should immediately get stashed as a string instead of an element.

I'll let @waylan weigh in. I think I understand the problem well enough, but the direction needs to be outlined before I could move forward on this.

By the way @waylan, this was much easier to debug than the old monstrosity. So great work on the refactor 😄.

facelessuser · 2020-10-14T13:29:26Z

Just hacking around a bit. When we see a raw HTML block, we could inspect whether there are newlines before it. Then we could force a similar behavior as what we have by default. This would prevent inline block HTML from being handled as block as it is specified incorrectly.

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index 3518d05..f865f78 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -127,6 +127,22 @@ class HTMLExtractorExtra(HTMLExtractor):
                 if not self.mdstack:
                     # Last item in stack is closed. Stash it
                     element = self.get_element()
+                    # If cleandoc is empty, assume first element
+                    item = self.cleandoc[-1] if self.cleandoc else '\n'
+                    # Got an element, assume there are no newlines prior to this.
+                    if not isinstance(item, str):
+                        item = ''
+                    # Ensure we have at least two newlines if we find one
+                    # If not, we must have an element that is block, but inline,
+                    # and if so, we should convert it to text as it won't get
+                    # processed as a block
+                    if not item.endswith('\n\n'):
+                        if item.endswith('\n'):
+                            self.cleandoc.append('\n')
+                        else:
+                            if "markdown" in element.attrib:
+                                del element.attrib["markdown"]
+                            element = etree.tostring(element, encoding='utf-8', method='html').decode('utf-8')
                     self.cleandoc.append(self.md.htmlStash.store(element))
                     self.cleandoc.append('\n\n')
                     self.state = []

Alternatively, we could always force blocks to be treated as blocks even if they are inline by forcing new lines before them, but then I think we'd need to do that outside of md_in_html too for consistency.

Anyways, hopefully, this provides enough information for us to make a decision on how to handle this. I'm happy to push something through pull request once a decision of how we want to handle this is made.

facelessuser · 2020-10-14T13:32:29Z

Personally, I find forcing the newlines before the block to produce more sane results, but this could be considered undefined behavior if Markdown rules are not being followed 🤷 .

waylan · 2020-10-14T13:52:51Z

I haven't read through this entire report and @facelessuser's analysis yet. However, I will address the two failing cases in the initial report.

Hello<div markdown="1"></div>

That is not valid Markdown. A block-level element must start at the beginning of a line. Technically, the markdown="1" attribute should be completely ignored in this case (just as it would be ignored for <span markdown="1">). The fact that we are getting the error suggests that that is not happening.
Hello\n<div markdown="1"></div>

As discussed in Unexpected (?) behavior change in parsing newlines #1041, this should have the same behavior as the case with two newlines.

Presumably the logic in the core needs to be replicated in the extension.

Regardless, in either case, a crashing bug is bad. We should probably include something in the postprocessor to handle any Element objects that slip through the cracks. At a minimum, it could fail silently.

facelessuser · 2020-10-14T13:58:11Z

Sounds good.

waylan · 2020-10-14T14:05:22Z

By the way @waylan, this was much easier to debug than the old monstrosity. So great work on the refactor 😄.

Thank you. 😊

waylan · 2020-10-14T16:07:57Z

Personally, I find forcing the newlines before the block to produce more sane results,

Yes, this is how its being done in the core. Every time we determine that a new raw HTML block is started, we insert one blank line before the placeholder (here). We don't even check if a blank line preceded the start tag (we only check that this is a new block and that the start tag is at the start of a line). In some cases we end up with extra blank lines, which is fine because Markdown ignores them. It appears that the blank line is not being inserted consistently in the extension.

We do have to be careful of one thing: that the blank lines are only inserted in Markdown content. Never within raw HTML tags which don't get parsed as Markdown. Doing that would alter the raw HTML in ways we promise not to.

By calling str on all stash elements we ensure they don't raise an error. Worse case, soemthing like `<Element 'div' at 0x000001B2DAE94900>` gets inserted into the output. However, with the override in the md_in_html extension, we actually serialize and reinsert the original HTML. Worse case, an HTML block which should be parsed as Markdown gets skipped by the extension (`<div markdown="block"></div>` gets inserting into the output). The tricky part is testing as there should be no known cases where this ever occurs. Therefore, we forefully pass an etree Element directly to the method in the test. That said, as Python-Markdown#1040 is unresolved at this point, I have tested locally with a real existing case and it works well. Related to Python-Markdown#1040.

By calling str on all stash elements we ensure they don't raise an error. Worse case, soemthing like `<Element 'div' at 0x000001B2DAE94900>` gets inserted into the output. However, with the override in the md_in_html extension, we actually serialize and reinsert the original HTML. Worse case, an HTML block which should be parsed as Markdown gets skipped by the extension (`<div markdown="block"></div>` gets inserting into the output). The tricky part is testing as there should be no known cases where this ever occurs. Therefore, we forefully pass an etree Element directly to the method in the test. That said, as #1040 is unresolved at this point, I have tested locally with a real existing case and it works well. Related to #1040.

waylan · 2020-10-14T19:27:24Z

We should probably include something in the postprocessor to handle any Element objects that slip through the cracks.

That has been addressed in #1042, which means no more crashing errors. We still need to address the incorrect parsing in the extension.

waylan · 2020-10-14T20:25:46Z

I pushed some tests to #1043 and a fix for the simplest one. I ran out of time on the other two failing tests. At least the desired behavior is defined. Contributions are welcome.

facelessuser · 2020-10-14T21:56:59Z

Cool. I should be able to take a crack at these.

facelessuser · 2020-10-15T01:37:01Z

I've resolved all the tests.

facelessuser · 2020-10-15T01:44:46Z

Fixes are at #1044.

@waylan I am a bit new to the new HTMLParser, so if I'm doing anything sub-optimal, let me know. I know this stuff is fresh in your mind. But the current changes seem to at least not break existing tests while satisfying your new tests.

Fixes #1040 and fixes #1045.

waylan added bug Bug report. extension Related to one or more of the included extensions. labels Oct 14, 2020

waylan mentioned this issue Oct 14, 2020

Account for Etree Elements in HTML Stash #1042

Merged

waylan mentioned this issue Oct 14, 2020

Bring md_in_html behavior inline with base. #1043

Closed

facelessuser mentioned this issue Oct 15, 2020

Completes pull #1043 which fixes #1040 #1044

Merged

waylan closed this as completed in #1044 Oct 19, 2020

waylan pushed a commit that referenced this issue Oct 19, 2020

Properly parse inline HTML in md_in_html

2766698

Fixes #1040 and fixes #1045.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

dbader commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

waylan commented Oct 14, 2020 •

edited

Loading

facelessuser commented Oct 14, 2020

waylan commented Oct 14, 2020 •

edited

Loading

waylan commented Oct 14, 2020 •

edited

Loading

waylan commented Oct 14, 2020

waylan commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

Comments

dbader commented Oct 14, 2020

Steps to reproduce:

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 14, 2020

waylan commented Oct 14, 2020 • edited Loading

facelessuser commented Oct 14, 2020

waylan commented Oct 14, 2020 • edited Loading

waylan commented Oct 14, 2020 • edited Loading

waylan commented Oct 14, 2020

waylan commented Oct 14, 2020

facelessuser commented Oct 14, 2020

facelessuser commented Oct 15, 2020

facelessuser commented Oct 15, 2020

waylan commented Oct 14, 2020 •

edited

Loading

waylan commented Oct 14, 2020 •

edited

Loading

waylan commented Oct 14, 2020 •

edited

Loading