-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
markdown.extensions.md_in_html fails with TypeError: expected string or bytes-like object #1040
Comments
I imagine this has to do with the new HTML handler. It looks like there may be a corner case where we think we are handing over strings, but etree elements are slipping through. |
Hmm, looks like somehow we are stashing etree objects instead of strings.... |
It appears it gets handled by diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index 3518d05..e1fcc7a 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -127,6 +127,7 @@ class HTMLExtractorExtra(HTMLExtractor):
if not self.mdstack:
# Last item in stack is closed. Stash it
element = self.get_element()
+ self.cleandoc.append('\n\n')
self.cleandoc.append(self.md.htmlStash.store(element))
self.cleandoc.append('\n\n')
self.state = []
Now this causes a different behavior than when
Is this discrepancy a problem? That is probably a question for @waylan. I guess if we wanted to keep the handling the same as when I'll let @waylan weigh in. I think I understand the problem well enough, but the direction needs to be outlined before I could move forward on this. By the way @waylan, this was much easier to debug than the old monstrosity. So great work on the refactor 😄. |
Just hacking around a bit. When we see a raw HTML block, we could inspect whether there are newlines before it. Then we could force a similar behavior as what we have by default. This would prevent inline block HTML from being handled as block as it is specified incorrectly. diff --git a/markdown/extensions/md_in_html.py b/markdown/extensions/md_in_html.py
index 3518d05..f865f78 100644
--- a/markdown/extensions/md_in_html.py
+++ b/markdown/extensions/md_in_html.py
@@ -127,6 +127,22 @@ class HTMLExtractorExtra(HTMLExtractor):
if not self.mdstack:
# Last item in stack is closed. Stash it
element = self.get_element()
+ # If cleandoc is empty, assume first element
+ item = self.cleandoc[-1] if self.cleandoc else '\n'
+ # Got an element, assume there are no newlines prior to this.
+ if not isinstance(item, str):
+ item = ''
+ # Ensure we have at least two newlines if we find one
+ # If not, we must have an element that is block, but inline,
+ # and if so, we should convert it to text as it won't get
+ # processed as a block
+ if not item.endswith('\n\n'):
+ if item.endswith('\n'):
+ self.cleandoc.append('\n')
+ else:
+ if "markdown" in element.attrib:
+ del element.attrib["markdown"]
+ element = etree.tostring(element, encoding='utf-8', method='html').decode('utf-8')
self.cleandoc.append(self.md.htmlStash.store(element))
self.cleandoc.append('\n\n')
self.state = []
Alternatively, we could always force blocks to be treated as blocks even if they are inline by forcing new lines before them, but then I think we'd need to do that outside of Anyways, hopefully, this provides enough information for us to make a decision on how to handle this. I'm happy to push something through pull request once a decision of how we want to handle this is made. |
Personally, I find forcing the newlines before the block to produce more sane results, but this could be considered undefined behavior if Markdown rules are not being followed 🤷 . |
I haven't read through this entire report and @facelessuser's analysis yet. However, I will address the two failing cases in the initial report.
Presumably the logic in the core needs to be replicated in the extension. Regardless, in either case, a crashing bug is bad. We should probably include something in the postprocessor to handle any |
Sounds good. |
Thank you. 😊 |
Yes, this is how its being done in the core. Every time we determine that a new raw HTML block is started, we insert one blank line before the placeholder (here). We don't even check if a blank line preceded the start tag (we only check that this is a new block and that the start tag is at the start of a line). In some cases we end up with extra blank lines, which is fine because Markdown ignores them. It appears that the blank line is not being inserted consistently in the extension. We do have to be careful of one thing: that the blank lines are only inserted in Markdown content. Never within raw HTML tags which don't get parsed as Markdown. Doing that would alter the raw HTML in ways we promise not to. |
By calling str on all stash elements we ensure they don't raise an error. Worse case, soemthing like `<Element 'div' at 0x000001B2DAE94900>` gets inserted into the output. However, with the override in the md_in_html extension, we actually serialize and reinsert the original HTML. Worse case, an HTML block which should be parsed as Markdown gets skipped by the extension (`<div markdown="block"></div>` gets inserting into the output). The tricky part is testing as there should be no known cases where this ever occurs. Therefore, we forefully pass an etree Element directly to the method in the test. That said, as Python-Markdown#1040 is unresolved at this point, I have tested locally with a real existing case and it works well. Related to Python-Markdown#1040.
By calling str on all stash elements we ensure they don't raise an error. Worse case, soemthing like `<Element 'div' at 0x000001B2DAE94900>` gets inserted into the output. However, with the override in the md_in_html extension, we actually serialize and reinsert the original HTML. Worse case, an HTML block which should be parsed as Markdown gets skipped by the extension (`<div markdown="block"></div>` gets inserting into the output). The tricky part is testing as there should be no known cases where this ever occurs. Therefore, we forefully pass an etree Element directly to the method in the test. That said, as #1040 is unresolved at this point, I have tested locally with a real existing case and it works well. Related to #1040.
That has been addressed in #1042, which means no more crashing errors. We still need to address the incorrect parsing in the extension. |
I pushed some tests to #1043 and a fix for the simplest one. I ran out of time on the other two failing tests. At least the desired behavior is defined. Contributions are welcome. |
Cool. I should be able to take a crack at these. |
I've resolved all the tests. |
I'm running into an issue with "simple" Markdown + HTML examples crashing with a
TypeError
. For example, the following throws aTypeError: expected string or bytes-like object
exception:Versions:
Markdown==3.3.1
, CPython 3.8.6I haven't had time yet to investigate this further, but I wanted to get an issue opened with some failing test cases in case others are running into this as well. I suspect this has something to do with the raw HTML parser rewrite in #803.
Steps to reproduce:
Save the following as
mderror.py
:Result:
The text was updated successfully, but these errors were encountered: