Trailing backticks in markdown files #247

oscarbenjamin · 2024-04-02T17:40:50Z

This follows gh-220. After gh-221 it is documented that a markdown block needs to end with a blank line before the backticks like:

```
>>> 1 + 1
2

```

I think that this should be considered a bug so I'm opening this issue.

The source of the bug is the stdlib doctest module (python/cpython#116546). It probably does not make sense to fix it at the stdlib level because the stdlib doctest parser is very rudimentary and does not even know that it is parsing markdown syntax at all. There is also a pytest issue (pytest-dev/pytest#7374) which was closed because pytest's doctest support is unmaintained.

It seems to me that here in pytest-doctestplus is the place to fix this since markdown files for documentation are common.

The erroneous result from the stdlib parser comes in here at the start of parsing:

pytest-doctestplus/pytest_doctestplus/plugin.py

Lines 393 to 394 in 84fd2e2

    
           def parse(self, s, name=None): 
        
               result = doctest.DocTestParser.parse(self, s, name=name)

To handle markdown properly the parser would need to be designed for markdown. I suggest having an alternate MarkdownParser class that would look for backticks rather than depending on the stdlib's PS1 regex. The rst directives would ideally also have markdown variants but I am not sure how those could look. Once the fenced sections are extracted from the markdown file they can be parsed by the existing parser to separate input/output.

The text was updated successfully, but these errors were encountered:

oscarbenjamin · 2024-04-02T18:43:01Z

This would be a quick and dirty fix:

diff --git a/pytest_doctestplus/plugin.py b/pytest_doctestplus/plugin.py
index c365a44..85188f3 100644
--- a/pytest_doctestplus/plugin.py
+++ b/pytest_doctestplus/plugin.py
@@ -391,6 +391,15 @@ def pytest_configure(config):
         """

         def parse(self, s, name=None):
+            ext = os.path.splitext(name)[1] if name else '.rst'
+
+            # Remove fence lines from code blocks in markdown files. Replacing
+            # these with blank lines means that the parser can find the doctest
+            # fragments. This might seem a bit hacky but so is the regex used
+            # by stdlib DocTestParser.
+            if ext == '.md':
+                s = re.sub(r'^```.*$', '', s, flags=re.MULTILINE)
+
             result = doctest.DocTestParser.parse(self, s, name=name)

             # result is a sequence of alternating text chunks and
@@ -402,7 +411,6 @@ def pytest_configure(config):
             skip_next = False
             skip_all = False

-            ext = os.path.splitext(name)[1] if name else '.rst'
             if ext not in comment_characters:
                 warnings.warn("file format '{}' is not recognized, assuming "
                               "'{}' as the comment character."

The proper solution would be to use a markdown parsing library like marko or something but maybe that's overkill in this case.

pllim · 2024-04-02T18:50:03Z

If these few lines are sufficient, I'd say go for it though I am also curious if @lpsinger has any input here. Thanks!

oscarbenjamin · 2024-04-02T22:03:56Z

A slight correction to the diff above. The difference is to allow leading spaces when trimming the backticks.

This diff is enough to make it work for the sympy docs:

diff --git a/pytest_doctestplus/plugin.py b/pytest_doctestplus/plugin.py
index c365a44..65b6321 100644
--- a/pytest_doctestplus/plugin.py
+++ b/pytest_doctestplus/plugin.py
@@ -391,6 +391,15 @@ def pytest_configure(config):
         """

         def parse(self, s, name=None):
+            ext = os.path.splitext(name)[1] if name else '.rst'
+
+            # Remove fence lines from code blocks in markdown files. Replacing
+            # these with blank lines means that the parser can find the doctest
+            # fragments. This might seem a bit hacky but so is the regex used
+            # by stdlib DocTestParser.
+            if ext == '.md':
+                s = re.sub(r'^ *```.*$', '', s, flags=re.MULTILINE)
+
             result = doctest.DocTestParser.parse(self, s, name=name)

             # result is a sequence of alternating text chunks and
@@ -402,7 +411,6 @@ def pytest_configure(config):
             skip_next = False
             skip_all = False

-            ext = os.path.splitext(name)[1] if name else '.rst'
             if ext not in comment_characters:
                 warnings.warn("file format '{}' is not recognized, assuming "
                               "'{}' as the comment character."

What this does is just to remove lines that contain triple backticks. There are problems with that though e.g. valid test output might contain lines with triple backticks (although that is unusual).

The commonmark specification lists a more complete set of rules for defining a code block within a markdown file. There are many cases not handled by the diff above although it would work for the most common cases:
https://spec.commonmark.org/0.30/#indented-code-blocks

lpsinger · 2024-04-03T14:22:16Z

If these few lines are sufficient, I'd say go for it though I am also curious if @lpsinger has any input here. Thanks!

I'm at the American Physical Society meeting this week. I'll take a close look when I get back. It would be more satisfying to fix this upstream in stdlib or pytest, but this approach appears sound.

Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this?

def foo(x):
    """Frobnicate the bar.

    >>> foo(x)
    'You are standing in an open field west of a white house.'

    """
    ...  # function code here

oscarbenjamin · 2024-04-03T23:59:25Z

Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this?

No, it doesn't. I'm not sure why...

oscarbenjamin · 2024-04-05T22:07:19Z

Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this?

No, it doesn't. I'm not sure why...

Oh, I guess it is because the .py files are not actually parsed by doctest to extract the docstrings. It just gets them from __doc__ using runtime introspection.

lpsinger · 2024-04-09T14:09:28Z

@oscarbenjamin, wouldn't your quick fix break reporting of line numbers?

oscarbenjamin · 2024-04-09T16:28:01Z

wouldn't your quick fix break reporting of line numbers?

I don't think so. The newlines are preserved but the rest of the line is removed:

In [1]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [2]: print(s)
Here is how to do it:
```
>>> 1 + 1
2
```

In [3]: import re

In [4]: print(re.sub(r'^ *```.*$', '', s, flags=re.MULTILINE))
Here is how to do it:

>>> 1 + 1
2

In [5]:

Without regex it is basically:

In [8]: replace_fence = lambda l: l if not l.lstrip().startswith('```') else ''

In [9]: print('\n'.join(map(replace_fence, s.splitlines())))
Here is how to do it:

>>> 1 + 1
2

Maybe it should also strip tabs rather than only spaces. Using .lstrip removes all whitespace. The regex above only removes spaces but could be changed to use \s for whitespace:

re.sub(r'^\s*```.*$', '', s, flags=re.MULTILINE)

lpsinger · 2024-04-09T18:51:40Z

I think that using a Markdown parser would be more robust. Personally, I wouldn't mind the dependency, especially if it was activated by an "extra" like pip install pytest-doctestplus[md]. Would you be interested in creating a PR for that?

bsipocz · 2024-04-09T20:36:43Z

👍 for having an md extra

oscarbenjamin · 2024-04-09T21:26:16Z

Would you be interested in creating a PR for that?

I can have a go.

To be clear though I have no prior experience with parsing markdown...

Does anyone have any particular opinion of which markdown parsers are suitable?

Some examples:

https://github.com/miyuchina/mistletoe
https://github.com/frostming/marko
https://github.com/Python-Markdown/markdown
https://github.com/lepture/mistune

It doesn't look like any of these has any dependencies.

It is not clear if python-markdown provides a usable interface for what is needed here or if it just renders markdown into html. I think marko can do what is needed but it seems a bit overly complicated. I think mistletoe can do what is needed but it is not that well documented.

Here the author of python-markdown recommends mistune for a somewhat similar task:
https://stackoverflow.com/questions/27349951/parse-and-traverse-elements-from-a-markdown-file

This is probably what is needed if using mistune:
https://mistune.lepture.com/en/latest/guide.html#abstract-syntax-tree

There are also a number of other projects that seem to be more directly targeted at extracting code blocks from markdown but do not seem to have as many users.

oscarbenjamin · 2024-04-09T21:28:44Z

This is what it looks like with mistune:

In [1]: import mistune

In [2]: markdown = mistune.create_markdown(renderer=None)

In [3]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [4]: markdown(s)
Out[4]: 
[{'type': 'paragraph',
  'children': [{'type': 'text', 'raw': 'Here is how to do it:'}]},
 {'type': 'block_code',
  'raw': '>>> 1 + 1\n2\n',
  'style': 'fenced',
  'marker': '```'}]

I guess you just go through looking for block_code and extracting the raw part.

bsipocz · 2024-04-09T21:32:36Z

If we could stay in the scientific python adjacent ecosystem, it would be great. E.g. the jupyter notebook infrastructure (we do have most of our tutorials in myst-nb format) seems to build on https://github.com/executablebooks/markdown-it-py, so would look into whether that ticks all the boxes for here, too.

oscarbenjamin · 2024-04-09T21:42:37Z

It looks like markdown-it works:

In [1]: from markdown_it import MarkdownIt
   ...: 
   ...: md = MarkdownIt()

In [2]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [3]: for token in md.parse(s):
   ...:     print(token)
   ...: 
Token(type='paragraph_open', tag='p', nesting=1, attrs={}, map=[0, 1], level=0, children=None, content='', markup='', info='', meta={}, block=True, hidden=False)
Token(type='inline', tag='', nesting=0, attrs={}, map=[0, 1], level=1, children=[Token(type='text', tag='', nesting=0, attrs={}, map=None, level=0, children=None, content='Here is how to do it:', markup='', info='', meta={}, block=False, hidden=False)], content='Here is how to do it:', markup='', info='', meta={}, block=True, hidden=False)
Token(type='paragraph_close', tag='p', nesting=-1, attrs={}, map=None, level=0, children=None, content='', markup='', info='', meta={}, block=True, hidden=False)
Token(type='fence', tag='code', nesting=0, attrs={}, map=[1, 5], level=0, children=None, content='>>> 1 + 1\n2\n', markup='```', info='', meta={}, block=True, hidden=False)

In [4]: for token in md.parse(s):
   ...:     if token.type == 'fence':
   ...:         print(token)
   ...: 
Token(type='fence', tag='code', nesting=0, attrs={}, map=[1, 5], level=0, children=None, content='>>> 1 + 1\n2\n', markup='```', info='', meta={}, block=True, hidden=False)

In [5]: for token in md.parse(s):
   ...:     if token.type == 'fence':
   ...:         print(token.content)
   ...: 
>>> 1 + 1
2

oscarbenjamin · 2024-04-09T21:48:33Z

Or maybe like this:

In [20]: from markdown_it.tree import SyntaxTreeNode
    ...: 
    ...: md = MarkdownIt("commonmark")

In [21]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [22]: tokens = md.parse(s)

In [23]: node = SyntaxTreeNode(tokens)

In [24]: node.children
Out[24]: [SyntaxTreeNode(paragraph), SyntaxTreeNode(fence)]

In [25]: node.children[1].content
Out[25]: '>>> 1 + 1\n2\n'

I think markdown-it works for this but would have to try on a larger file to be sure.

pllim added the bug label Apr 2, 2024

oscarbenjamin mentioned this issue Apr 2, 2024

doctest cannot be used to test markdown files. python/cpython#116546

Open

oscarbenjamin mentioned this issue Apr 9, 2024

docs: adjust doctests for pytest-doctestplus sympy/sympy#26451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trailing backticks in markdown files #247

Trailing backticks in markdown files #247

oscarbenjamin commented Apr 2, 2024

oscarbenjamin commented Apr 2, 2024

pllim commented Apr 2, 2024

oscarbenjamin commented Apr 2, 2024

lpsinger commented Apr 3, 2024

oscarbenjamin commented Apr 3, 2024

oscarbenjamin commented Apr 5, 2024

lpsinger commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

lpsinger commented Apr 9, 2024

bsipocz commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

bsipocz commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

Trailing backticks in markdown files #247

Trailing backticks in markdown files #247

Comments

oscarbenjamin commented Apr 2, 2024

oscarbenjamin commented Apr 2, 2024

pllim commented Apr 2, 2024

oscarbenjamin commented Apr 2, 2024

lpsinger commented Apr 3, 2024

oscarbenjamin commented Apr 3, 2024

oscarbenjamin commented Apr 5, 2024

lpsinger commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

lpsinger commented Apr 9, 2024

bsipocz commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

bsipocz commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024

oscarbenjamin commented Apr 9, 2024