Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040

Closed
dbader opened this issue Oct 14, 2020 · 14 comments · Fixed by #1044
Closed
Labels
bug Bug report. extension Related to one or more of the included extensions.

Comments

@dbader
Copy link
Contributor

dbader commented Oct 14, 2020

I'm running into an issue with "simple" Markdown + HTML examples crashing with a TypeError. For example, the following throws a TypeError: expected string or bytes-like object exception:

markdown.markdown(
    'Hello\n<div markdown="1"></div>', 
    extensions=["markdown.extensions.md_in_html"],
)

Versions: Markdown==3.3.1, CPython 3.8.6

I haven't had time yet to investigate this further, but I wanted to get an issue opened with some failing test cases in case others are running into this as well. I suspect this has something to do with the raw HTML parser rewrite in #803.

Steps to reproduce:

Save the following as mderror.py:

import traceback
import markdown


def try_render(text):
    print(f"\n\n---- Rendering: {text!r}")
    try:
        html = markdown.markdown(
            text, extensions=["markdown.extensions.md_in_html"],
        )
        print(f"---- PASS: {html!r}")
    except Exception as e:
        print("---- FAILED")
        traceback.print_exc()


try_render('Hello<div markdown="1"></div>')  # Error
try_render('Hello\n<div markdown="1"></div>')  # Error
try_render('Hello\n\n<div markdown="1"></div>')  # Okay

Result:

$ python mderror.py


---- Rendering: 'Hello<div markdown="1"></div>'
<Element 'div' at 0x110758310>
---- FAILED
Traceback (most recent call last):
  File "mderror.py", line 8, in try_render
    html = markdown.markdown(
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 386, in markdown
    return md.convert(text)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 290, in convert
    output = pp.run(output)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 73, in run
    if self.isblocklevel(html):
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 94, in isblocklevel
    m = re.match(r"^\<\/?([^ >]+)", html)
  File "/Users/dbader/.pyenv/versions/3.8.6/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object


---- Rendering: 'Hello\n<div markdown="1"></div>'
<Element 'div' at 0x110758900>
---- FAILED
Traceback (most recent call last):
  File "mderror.py", line 8, in try_render
    html = markdown.markdown(
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 386, in markdown
    return md.convert(text)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/core.py", line 290, in convert
    output = pp.run(output)
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 73, in run
    if self.isblocklevel(html):
  File "PROJECT_ROOT/venv/lib/python3.8/site-packages/markdown/postprocessors.py", line 94, in isblocklevel
    m = re.match(r"^\<\/?([^ >]+)", html)
  File "/Users/dbader/.pyenv/versions/3.8.6/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object


---- Rendering: 'Hello\n\n<div markdown="1"></div>'

---- PASS: '<p>Hello</p>\n<div></div>'
@facelessuser
Copy link
Collaborator

I imagine this has to do with the new HTML handler. It looks like there may be a corner case where we think we are handing over strings, but etree elements are slipping through.

@facelessuser
Copy link
Collaborator

Hmm, looks like somehow we are stashing etree objects instead of strings....

@facelessuser
Copy link
Collaborator

It appears it gets handled by md_in_html in the initial parse, but because it is within a paragraph, it doesn't get handled later by the block processor. We can force it to not fail by always injecting new lines before it to force it to be seen later as a block

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index 3518d05..e1fcc7a 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -127,6 +127,7 @@ class HTMLExtractorExtra(HTMLExtractor):
                 if not self.mdstack:
                     # Last item in stack is closed. Stash it
                     element = self.get_element()
+                    self.cleandoc.append('\n\n')
                     self.cleandoc.append(self.md.htmlStash.store(element))
                     self.cleandoc.append('\n\n')
                     self.state = []
---- Rendering: 'Hello<div markdown="1"></div>'
---- PASS: '<p>Hello</p>\n<div></div>'


---- Rendering: 'Hello\n<div markdown="1"></div>'
---- PASS: '<p>Hello</p>\n<div></div>'


---- Rendering: 'Hello\n\n<div markdown="1"></div>'
---- PASS: '<p>Hello</p>\n<div></div>'

Now this causes a different behavior than when markdown="1" is not included as then md_in_html doesn't process it and the normal HTML logic takes over.

---- Rendering: 'Hello<div></div>'
---- PASS: '<p>Hello<div></div></p>'


---- Rendering: 'Hello\n<div></div>'
---- PASS: '<p>Hello</p>\n<div></div>'


---- Rendering: 'Hello\n\n<div></div>'
---- PASS: '<p>Hello</p>\n<div></div>'

Is this discrepancy a problem? That is probably a question for @waylan.

I guess if we wanted to keep the handling the same as when markdown="1" is not included, we need an inline handler to catch these improper md_in_html cases and just convert them to strings and strip the markdown="1". Or we need to somehow identify that this case should immediately get stashed as a string instead of an element.

I'll let @waylan weigh in. I think I understand the problem well enough, but the direction needs to be outlined before I could move forward on this.

By the way @waylan, this was much easier to debug than the old monstrosity. So great work on the refactor 😄.

@facelessuser
Copy link
Collaborator

Just hacking around a bit. When we see a raw HTML block, we could inspect whether there are newlines before it. Then we could force a similar behavior as what we have by default. This would prevent inline block HTML from being handled as block as it is specified incorrectly.

diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index 3518d05..f865f78 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -127,6 +127,22 @@ class HTMLExtractorExtra(HTMLExtractor):
                 if not self.mdstack:
                     # Last item in stack is closed. Stash it
                     element = self.get_element()
+                    # If cleandoc is empty, assume first element
+                    item = self.cleandoc[-1] if self.cleandoc else '\n'
+                    # Got an element, assume there are no newlines prior to this.
+                    if not isinstance(item, str):
+                        item = ''
+                    # Ensure we have at least two newlines if we find one
+                    # If not, we must have an element that is block, but inline,
+                    # and if so, we should convert it to text as it won't get
+                    # processed as a block
+                    if not item.endswith('\n\n'):
+                        if item.endswith('\n'):
+                            self.cleandoc.append('\n')
+                        else:
+                            if "markdown" in element.attrib:
+                                del element.attrib["markdown"]
+                            element = etree.tostring(element, encoding='utf-8', method='html').decode('utf-8')
                     self.cleandoc.append(self.md.htmlStash.store(element))
                     self.cleandoc.append('\n\n')
                     self.state = []

Alternatively, we could always force blocks to be treated as blocks even if they are inline by forcing new lines before them, but then I think we'd need to do that outside of md_in_html too for consistency.

Anyways, hopefully, this provides enough information for us to make a decision on how to handle this. I'm happy to push something through pull request once a decision of how we want to handle this is made.

@facelessuser
Copy link
Collaborator

Personally, I find forcing the newlines before the block to produce more sane results, but this could be considered undefined behavior if Markdown rules are not being followed 🤷 .

@waylan
Copy link
Member

waylan commented Oct 14, 2020

I haven't read through this entire report and @facelessuser's analysis yet. However, I will address the two failing cases in the initial report.

  • Hello<div markdown="1"></div>

    That is not valid Markdown. A block-level element must start at the beginning of a line. Technically, the markdown="1" attribute should be completely ignored in this case (just as it would be ignored for <span markdown="1">). The fact that we are getting the error suggests that that is not happening.

  • Hello\n<div markdown="1"></div>

    As discussed in Unexpected (?) behavior change in parsing newlines #1041, this should have the same behavior as the case with two newlines.

Presumably the logic in the core needs to be replicated in the extension.

Regardless, in either case, a crashing bug is bad. We should probably include something in the postprocessor to handle any Element objects that slip through the cracks. At a minimum, it could fail silently.

@facelessuser
Copy link
Collaborator

Sounds good.

@waylan waylan added bug Bug report. extension Related to one or more of the included extensions. labels Oct 14, 2020
@waylan
Copy link
Member

waylan commented Oct 14, 2020

By the way @waylan, this was much easier to debug than the old monstrosity. So great work on the refactor 😄.

Thank you. 😊

@waylan
Copy link
Member

waylan commented Oct 14, 2020

Personally, I find forcing the newlines before the block to produce more sane results,

Yes, this is how its being done in the core. Every time we determine that a new raw HTML block is started, we insert one blank line before the placeholder (here). We don't even check if a blank line preceded the start tag (we only check that this is a new block and that the start tag is at the start of a line). In some cases we end up with extra blank lines, which is fine because Markdown ignores them. It appears that the blank line is not being inserted consistently in the extension.

We do have to be careful of one thing: that the blank lines are only inserted in Markdown content. Never within raw HTML tags which don't get parsed as Markdown. Doing that would alter the raw HTML in ways we promise not to.

waylan added a commit to waylan/markdown that referenced this issue Oct 14, 2020
By calling str on all stash elements we ensure they don't raise an error.
Worse case, soemthing like `<Element 'div' at 0x000001B2DAE94900>` gets
inserted into the output. However, with the override in the md_in_html
extension, we actually serialize and reinsert the original HTML. Worse case,
an HTML block which should be parsed as Markdown gets skipped by the
extension (`<div markdown="block"></div>` gets inserting into the output).

The tricky part is testing as there should be no known cases where this
ever occurs. Therefore, we forefully pass an etree Element directly to
the method in the test. That said, as Python-Markdown#1040 is unresolved at this point,
I have tested locally with a real existing case and it works well.

Related to Python-Markdown#1040.
waylan added a commit that referenced this issue Oct 14, 2020
By calling str on all stash elements we ensure they don't raise an error.
Worse case, soemthing like `<Element 'div' at 0x000001B2DAE94900>` gets
inserted into the output. However, with the override in the md_in_html
extension, we actually serialize and reinsert the original HTML. Worse case,
an HTML block which should be parsed as Markdown gets skipped by the
extension (`<div markdown="block"></div>` gets inserting into the output).

The tricky part is testing as there should be no known cases where this
ever occurs. Therefore, we forefully pass an etree Element directly to
the method in the test. That said, as #1040 is unresolved at this point,
I have tested locally with a real existing case and it works well.

Related to #1040.
@waylan
Copy link
Member

waylan commented Oct 14, 2020

We should probably include something in the postprocessor to handle any Element objects that slip through the cracks.

That has been addressed in #1042, which means no more crashing errors. We still need to address the incorrect parsing in the extension.

@waylan
Copy link
Member

waylan commented Oct 14, 2020

I pushed some tests to #1043 and a fix for the simplest one. I ran out of time on the other two failing tests. At least the desired behavior is defined. Contributions are welcome.

@facelessuser
Copy link
Collaborator

Cool. I should be able to take a crack at these.

@facelessuser
Copy link
Collaborator

I've resolved all the tests.

@facelessuser
Copy link
Collaborator

Fixes are at #1044.

@waylan I am a bit new to the new HTMLParser, so if I'm doing anything sub-optimal, let me know. I know this stuff is fresh in your mind. But the current changes seem to at least not break existing tests while satisfying your new tests.

waylan pushed a commit that referenced this issue Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. extension Related to one or more of the included extensions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants