Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trailing backticks in markdown files #247

Open
oscarbenjamin opened this issue Apr 2, 2024 · 15 comments
Open

Trailing backticks in markdown files #247

oscarbenjamin opened this issue Apr 2, 2024 · 15 comments
Labels

Comments

@oscarbenjamin
Copy link
Contributor

This follows gh-220. After gh-221 it is documented that a markdown block needs to end with a blank line before the backticks like:

```
>>> 1 + 1
2

```

I think that this should be considered a bug so I'm opening this issue.

The source of the bug is the stdlib doctest module (python/cpython#116546). It probably does not make sense to fix it at the stdlib level because the stdlib doctest parser is very rudimentary and does not even know that it is parsing markdown syntax at all. There is also a pytest issue (pytest-dev/pytest#7374) which was closed because pytest's doctest support is unmaintained.

It seems to me that here in pytest-doctestplus is the place to fix this since markdown files for documentation are common.

The erroneous result from the stdlib parser comes in here at the start of parsing:

def parse(self, s, name=None):
result = doctest.DocTestParser.parse(self, s, name=name)

To handle markdown properly the parser would need to be designed for markdown. I suggest having an alternate MarkdownParser class that would look for backticks rather than depending on the stdlib's PS1 regex. The rst directives would ideally also have markdown variants but I am not sure how those could look. Once the fenced sections are extracted from the markdown file they can be parsed by the existing parser to separate input/output.

@oscarbenjamin
Copy link
Contributor Author

This would be a quick and dirty fix:

diff --git a/pytest_doctestplus/plugin.py b/pytest_doctestplus/plugin.py
index c365a44..85188f3 100644
--- a/pytest_doctestplus/plugin.py
+++ b/pytest_doctestplus/plugin.py
@@ -391,6 +391,15 @@ def pytest_configure(config):
         """

         def parse(self, s, name=None):
+            ext = os.path.splitext(name)[1] if name else '.rst'
+
+            # Remove fence lines from code blocks in markdown files. Replacing
+            # these with blank lines means that the parser can find the doctest
+            # fragments. This might seem a bit hacky but so is the regex used
+            # by stdlib DocTestParser.
+            if ext == '.md':
+                s = re.sub(r'^```.*$', '', s, flags=re.MULTILINE)
+
             result = doctest.DocTestParser.parse(self, s, name=name)

             # result is a sequence of alternating text chunks and
@@ -402,7 +411,6 @@ def pytest_configure(config):
             skip_next = False
             skip_all = False

-            ext = os.path.splitext(name)[1] if name else '.rst'
             if ext not in comment_characters:
                 warnings.warn("file format '{}' is not recognized, assuming "
                               "'{}' as the comment character."

The proper solution would be to use a markdown parsing library like marko or something but maybe that's overkill in this case.

@pllim pllim added the bug label Apr 2, 2024
@pllim
Copy link
Contributor

pllim commented Apr 2, 2024

If these few lines are sufficient, I'd say go for it though I am also curious if @lpsinger has any input here. Thanks!

@oscarbenjamin
Copy link
Contributor Author

A slight correction to the diff above. The difference is to allow leading spaces when trimming the backticks.

This diff is enough to make it work for the sympy docs:

diff --git a/pytest_doctestplus/plugin.py b/pytest_doctestplus/plugin.py
index c365a44..65b6321 100644
--- a/pytest_doctestplus/plugin.py
+++ b/pytest_doctestplus/plugin.py
@@ -391,6 +391,15 @@ def pytest_configure(config):
         """

         def parse(self, s, name=None):
+            ext = os.path.splitext(name)[1] if name else '.rst'
+
+            # Remove fence lines from code blocks in markdown files. Replacing
+            # these with blank lines means that the parser can find the doctest
+            # fragments. This might seem a bit hacky but so is the regex used
+            # by stdlib DocTestParser.
+            if ext == '.md':
+                s = re.sub(r'^ *```.*$', '', s, flags=re.MULTILINE)
+
             result = doctest.DocTestParser.parse(self, s, name=name)

             # result is a sequence of alternating text chunks and
@@ -402,7 +411,6 @@ def pytest_configure(config):
             skip_next = False
             skip_all = False

-            ext = os.path.splitext(name)[1] if name else '.rst'
             if ext not in comment_characters:
                 warnings.warn("file format '{}' is not recognized, assuming "
                               "'{}' as the comment character."

What this does is just to remove lines that contain triple backticks. There are problems with that though e.g. valid test output might contain lines with triple backticks (although that is unusual).

The commonmark specification lists a more complete set of rules for defining a code block within a markdown file. There are many cases not handled by the diff above although it would work for the most common cases:
https://spec.commonmark.org/0.30/#indented-code-blocks

@lpsinger
Copy link
Contributor

lpsinger commented Apr 3, 2024

If these few lines are sufficient, I'd say go for it though I am also curious if @lpsinger has any input here. Thanks!

I'm at the American Physical Society meeting this week. I'll take a close look when I get back. It would be more satisfying to fix this upstream in stdlib or pytest, but this approach appears sound.

Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this?

def foo(x):
    """Frobnicate the bar.

    >>> foo(x)
    'You are standing in an open field west of a white house.'

    """
    ...  # function code here

@oscarbenjamin
Copy link
Contributor Author

Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this?

No, it doesn't. I'm not sure why...

@oscarbenjamin
Copy link
Contributor Author

Does the limitation that the doctest must end in a blank line apply to regular docstrings in Python code, like this?

No, it doesn't. I'm not sure why...

Oh, I guess it is because the .py files are not actually parsed by doctest to extract the docstrings. It just gets them from __doc__ using runtime introspection.

@lpsinger
Copy link
Contributor

lpsinger commented Apr 9, 2024

@oscarbenjamin, wouldn't your quick fix break reporting of line numbers?

@oscarbenjamin
Copy link
Contributor Author

wouldn't your quick fix break reporting of line numbers?

I don't think so. The newlines are preserved but the rest of the line is removed:

In [1]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [2]: print(s)
Here is how to do it:
```
>>> 1 + 1
2
```

In [3]: import re

In [4]: print(re.sub(r'^ *```.*$', '', s, flags=re.MULTILINE))
Here is how to do it:

>>> 1 + 1
2

In [5]: 

Without regex it is basically:

In [8]: replace_fence = lambda l: l if not l.lstrip().startswith('```') else ''

In [9]: print('\n'.join(map(replace_fence, s.splitlines())))
Here is how to do it:

>>> 1 + 1
2

Maybe it should also strip tabs rather than only spaces. Using .lstrip removes all whitespace. The regex above only removes spaces but could be changed to use \s for whitespace:

re.sub(r'^\s*```.*$', '', s, flags=re.MULTILINE)

@lpsinger
Copy link
Contributor

lpsinger commented Apr 9, 2024

I think that using a Markdown parser would be more robust. Personally, I wouldn't mind the dependency, especially if it was activated by an "extra" like pip install pytest-doctestplus[md]. Would you be interested in creating a PR for that?

@bsipocz
Copy link
Member

bsipocz commented Apr 9, 2024

👍 for having an md extra

@oscarbenjamin
Copy link
Contributor Author

Would you be interested in creating a PR for that?

I can have a go.

To be clear though I have no prior experience with parsing markdown...

Does anyone have any particular opinion of which markdown parsers are suitable?

Some examples:

https://github.com/miyuchina/mistletoe
https://github.com/frostming/marko
https://github.com/Python-Markdown/markdown
https://github.com/lepture/mistune

It doesn't look like any of these has any dependencies.

It is not clear if python-markdown provides a usable interface for what is needed here or if it just renders markdown into html. I think marko can do what is needed but it seems a bit overly complicated. I think mistletoe can do what is needed but it is not that well documented.

Here the author of python-markdown recommends mistune for a somewhat similar task:
https://stackoverflow.com/questions/27349951/parse-and-traverse-elements-from-a-markdown-file

This is probably what is needed if using mistune:
https://mistune.lepture.com/en/latest/guide.html#abstract-syntax-tree

There are also a number of other projects that seem to be more directly targeted at extracting code blocks from markdown but do not seem to have as many users.

@oscarbenjamin
Copy link
Contributor Author

This is what it looks like with mistune:

In [1]: import mistune

In [2]: markdown = mistune.create_markdown(renderer=None)

In [3]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [4]: markdown(s)
Out[4]: 
[{'type': 'paragraph',
  'children': [{'type': 'text', 'raw': 'Here is how to do it:'}]},
 {'type': 'block_code',
  'raw': '>>> 1 + 1\n2\n',
  'style': 'fenced',
  'marker': '```'}]

I guess you just go through looking for block_code and extracting the raw part.

@bsipocz
Copy link
Member

bsipocz commented Apr 9, 2024

If we could stay in the scientific python adjacent ecosystem, it would be great. E.g. the jupyter notebook infrastructure (we do have most of our tutorials in myst-nb format) seems to build on https://github.com/executablebooks/markdown-it-py, so would look into whether that ticks all the boxes for here, too.

@oscarbenjamin
Copy link
Contributor Author

It looks like markdown-it works:

In [1]: from markdown_it import MarkdownIt
   ...: 
   ...: md = MarkdownIt()

In [2]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [3]: for token in md.parse(s):
   ...:     print(token)
   ...: 
Token(type='paragraph_open', tag='p', nesting=1, attrs={}, map=[0, 1], level=0, children=None, content='', markup='', info='', meta={}, block=True, hidden=False)
Token(type='inline', tag='', nesting=0, attrs={}, map=[0, 1], level=1, children=[Token(type='text', tag='', nesting=0, attrs={}, map=None, level=0, children=None, content='Here is how to do it:', markup='', info='', meta={}, block=False, hidden=False)], content='Here is how to do it:', markup='', info='', meta={}, block=True, hidden=False)
Token(type='paragraph_close', tag='p', nesting=-1, attrs={}, map=None, level=0, children=None, content='', markup='', info='', meta={}, block=True, hidden=False)
Token(type='fence', tag='code', nesting=0, attrs={}, map=[1, 5], level=0, children=None, content='>>> 1 + 1\n2\n', markup='```', info='', meta={}, block=True, hidden=False)

In [4]: for token in md.parse(s):
   ...:     if token.type == 'fence':
   ...:         print(token)
   ...: 
Token(type='fence', tag='code', nesting=0, attrs={}, map=[1, 5], level=0, children=None, content='>>> 1 + 1\n2\n', markup='```', info='', meta={}, block=True, hidden=False)

In [5]: for token in md.parse(s):
   ...:     if token.type == 'fence':
   ...:         print(token.content)
   ...: 
>>> 1 + 1
2

@oscarbenjamin
Copy link
Contributor Author

Or maybe like this:

In [20]: from markdown_it.tree import SyntaxTreeNode
    ...: 
    ...: md = MarkdownIt("commonmark")

In [21]: s = 'Here is how to do it:\n```\n>>> 1 + 1\n2\n```'

In [22]: tokens = md.parse(s)

In [23]: node = SyntaxTreeNode(tokens)

In [24]: node.children
Out[24]: [SyntaxTreeNode(paragraph), SyntaxTreeNode(fence)]

In [25]: node.children[1].content
Out[25]: '>>> 1 + 1\n2\n'

I think markdown-it works for this but would have to try on a larger file to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants