-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trailing backticks in markdown files #247
Comments
This would be a quick and dirty fix: diff --git a/pytest_doctestplus/plugin.py b/pytest_doctestplus/plugin.py
index c365a44..85188f3 100644
--- a/pytest_doctestplus/plugin.py
+++ b/pytest_doctestplus/plugin.py
@@ -391,6 +391,15 @@ def pytest_configure(config):
"""
def parse(self, s, name=None):
+ ext = os.path.splitext(name)[1] if name else '.rst'
+
+ # Remove fence lines from code blocks in markdown files. Replacing
+ # these with blank lines means that the parser can find the doctest
+ # fragments. This might seem a bit hacky but so is the regex used
+ # by stdlib DocTestParser.
+ if ext == '.md':
+ s = re.sub(r'^```.*$', '', s, flags=re.MULTILINE)
+
result = doctest.DocTestParser.parse(self, s, name=name)
# result is a sequence of alternating text chunks and
@@ -402,7 +411,6 @@ def pytest_configure(config):
skip_next = False
skip_all = False
- ext = os.path.splitext(name)[1] if name else '.rst'
if ext not in comment_characters:
warnings.warn("file format '{}' is not recognized, assuming "
"'{}' as the comment character." The proper solution would be to use a markdown parsing library like marko or something but maybe that's overkill in this case. |
If these few lines are sufficient, I'd say go for it though I am also curious if @lpsinger has any input here. Thanks! |
A slight correction to the diff above. The difference is to allow leading spaces when trimming the backticks. This diff is enough to make it work for the sympy docs: diff --git a/pytest_doctestplus/plugin.py b/pytest_doctestplus/plugin.py
index c365a44..65b6321 100644
--- a/pytest_doctestplus/plugin.py
+++ b/pytest_doctestplus/plugin.py
@@ -391,6 +391,15 @@ def pytest_configure(config):
"""
def parse(self, s, name=None):
+ ext = os.path.splitext(name)[1] if name else '.rst'
+
+ # Remove fence lines from code blocks in markdown files. Replacing
+ # these with blank lines means that the parser can find the doctest
+ # fragments. This might seem a bit hacky but so is the regex used
+ # by stdlib DocTestParser.
+ if ext == '.md':
+ s = re.sub(r'^ *```.*$', '', s, flags=re.MULTILINE)
+
result = doctest.DocTestParser.parse(self, s, name=name)
# result is a sequence of alternating text chunks and
@@ -402,7 +411,6 @@ def pytest_configure(config):
skip_next = False
skip_all = False
- ext = os.path.splitext(name)[1] if name else '.rst'
if ext not in comment_characters:
warnings.warn("file format '{}' is not recognized, assuming "
"'{}' as the comment character." What this does is just to remove lines that contain triple backticks. There are problems with that though e.g. valid test output might contain lines with triple backticks (although that is unusual). The commonmark specification lists a more complete set of rules for defining a code block within a markdown file. There are many cases not handled by the diff above although it would work for the most common cases: |
I'm at the American Physical Society meeting this week. I'll take a close look when I get back. It would be more satisfying to fix this upstream in stdlib or pytest, but this approach appears sound. Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this? def foo(x):
"""Frobnicate the bar.
>>> foo(x)
'You are standing in an open field west of a white house.'
"""
... # function code here |
No, it doesn't. I'm not sure why... |
Oh, I guess it is because the .py files are not actually parsed by doctest to extract the docstrings. It just gets them from |
@oscarbenjamin, wouldn't your quick fix break reporting of line numbers? |
I don't think so. The newlines are preserved but the rest of the line is removed: In [1]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'
In [2]: print(s)
Here is how to do it:
```
>>> 1 + 1
2
```
In [3]: import re
In [4]: print(re.sub(r'^ *```.*$', '', s, flags=re.MULTILINE))
Here is how to do it:
>>> 1 + 1
2
In [5]: Without regex it is basically: In [8]: replace_fence = lambda l: l if not l.lstrip().startswith('```') else ''
In [9]: print('\n'.join(map(replace_fence, s.splitlines())))
Here is how to do it:
>>> 1 + 1
2 Maybe it should also strip tabs rather than only spaces. Using re.sub(r'^\s*```.*$', '', s, flags=re.MULTILINE) |
I think that using a Markdown parser would be more robust. Personally, I wouldn't mind the dependency, especially if it was activated by an "extra" like |
👍 for having an |
I can have a go. To be clear though I have no prior experience with parsing markdown... Does anyone have any particular opinion of which markdown parsers are suitable? Some examples: https://github.com/miyuchina/mistletoe It doesn't look like any of these has any dependencies. It is not clear if python-markdown provides a usable interface for what is needed here or if it just renders markdown into html. I think marko can do what is needed but it seems a bit overly complicated. I think mistletoe can do what is needed but it is not that well documented. Here the author of python-markdown recommends mistune for a somewhat similar task: This is probably what is needed if using mistune: There are also a number of other projects that seem to be more directly targeted at extracting code blocks from markdown but do not seem to have as many users. |
This is what it looks like with mistune: In [1]: import mistune
In [2]: markdown = mistune.create_markdown(renderer=None)
In [3]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'
In [4]: markdown(s)
Out[4]:
[{'type': 'paragraph',
'children': [{'type': 'text', 'raw': 'Here is how to do it:'}]},
{'type': 'block_code',
'raw': '>>> 1 + 1\n2\n',
'style': 'fenced',
'marker': '```'}] I guess you just go through looking for |
If we could stay in the scientific python adjacent ecosystem, it would be great. E.g. the jupyter notebook infrastructure (we do have most of our tutorials in myst-nb format) seems to build on https://github.com/executablebooks/markdown-it-py, so would look into whether that ticks all the boxes for here, too. |
It looks like markdown-it works: In [1]: from markdown_it import MarkdownIt
...:
...: md = MarkdownIt()
In [2]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'
In [3]: for token in md.parse(s):
...: print(token)
...:
Token(type='paragraph_open', tag='p', nesting=1, attrs={}, map=[0, 1], level=0, children=None, content='', markup='', info='', meta={}, block=True, hidden=False)
Token(type='inline', tag='', nesting=0, attrs={}, map=[0, 1], level=1, children=[Token(type='text', tag='', nesting=0, attrs={}, map=None, level=0, children=None, content='Here is how to do it:', markup='', info='', meta={}, block=False, hidden=False)], content='Here is how to do it:', markup='', info='', meta={}, block=True, hidden=False)
Token(type='paragraph_close', tag='p', nesting=-1, attrs={}, map=None, level=0, children=None, content='', markup='', info='', meta={}, block=True, hidden=False)
Token(type='fence', tag='code', nesting=0, attrs={}, map=[1, 5], level=0, children=None, content='>>> 1 + 1\n2\n', markup='```', info='', meta={}, block=True, hidden=False)
In [4]: for token in md.parse(s):
...: if token.type == 'fence':
...: print(token)
...:
Token(type='fence', tag='code', nesting=0, attrs={}, map=[1, 5], level=0, children=None, content='>>> 1 + 1\n2\n', markup='```', info='', meta={}, block=True, hidden=False)
In [5]: for token in md.parse(s):
...: if token.type == 'fence':
...: print(token.content)
...:
>>> 1 + 1
2 |
Or maybe like this: In [20]: from markdown_it.tree import SyntaxTreeNode
...:
...: md = MarkdownIt("commonmark")
In [21]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'
In [22]: tokens = md.parse(s)
In [23]: node = SyntaxTreeNode(tokens)
In [24]: node.children
Out[24]: [SyntaxTreeNode(paragraph), SyntaxTreeNode(fence)]
In [25]: node.children[1].content
Out[25]: '>>> 1 + 1\n2\n' I think markdown-it works for this but would have to try on a larger file to be sure. |
This follows gh-220. After gh-221 it is documented that a markdown block needs to end with a blank line before the backticks like:
I think that this should be considered a bug so I'm opening this issue.
The source of the bug is the stdlib doctest module (python/cpython#116546). It probably does not make sense to fix it at the stdlib level because the stdlib doctest parser is very rudimentary and does not even know that it is parsing markdown syntax at all. There is also a pytest issue (pytest-dev/pytest#7374) which was closed because pytest's doctest support is unmaintained.
It seems to me that here in pytest-doctestplus is the place to fix this since markdown files for documentation are common.
The erroneous result from the stdlib parser comes in here at the start of parsing:
pytest-doctestplus/pytest_doctestplus/plugin.py
Lines 393 to 394 in 84fd2e2
To handle markdown properly the parser would need to be designed for markdown. I suggest having an alternate
MarkdownParser
class that would look for backticks rather than depending on the stdlib's PS1 regex. The rst directives would ideally also have markdown variants but I am not sure how those could look. Once the fenced sections are extracted from the markdown file they can be parsed by the existing parser to separate input/output.The text was updated successfully, but these errors were encountered: