Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sourcepos? #26

Closed
elibarzilay opened this issue Apr 7, 2015 · 10 comments
Closed

More sourcepos? #26

elibarzilay opened this issue Apr 7, 2015 · 10 comments

Comments

@elibarzilay
Copy link
Contributor

It would be nice if there is more source position in the output, to the point where it is possible to track the source of every bit of text.

It seems to me that currently, this tool is overall very limited (eg, doesn't come with a huge number of extensions and command line arguments to switch them on and off). This looks like a good idea for something that is supposed to serve as a basis for some system that will extend it. In my case, I played with the idea of embedding it with a different markup system, which would result in having the best of both (convenience of cmark, flexibility of a markup when needed). In any case, one way to get something like that going without implementing yet another markdown (or commonmark) parser is to use an existing tool like cmark. But unfortunately this requires knowing exactly where each bit of text came from, not just the current per line thing (?). This is because my target system is a proper language, so source tracking is important for syntax errors etc.

Ideally, this could be done for --smart replacement text too...

@jgm
Copy link
Member

jgm commented Apr 7, 2015

Currently we have source position (start and end line and column) for block-level elements, but not for inline-level elements.

Storing source position for inline elements would require making a few parts of the parser more complex, and it might have an efficiency cost, but it could certainly be done.

Indeed, it is already done (if I'm not mistaken) in Knagis/CommonMark.NET and perhaps in other implementations too.

@MathieuDuponchelle
Copy link
Contributor

MathieuDuponchelle commented Nov 29, 2016

Hey @jgm, as I said in #131 I've started experimenting with offsets, see https://github.com/MathieuDuponchelle/cmark/commits/more-sourcepos

This branch is incomplete, but I think it's a good first step in the right direction, pretty simple and already showing some results, given this input file:

a\
b

* a
  * b

# c ## 




> yo
papi

‘a
  b

The code in that branch doesn't get thrown off and reports correct extents for the inlines near the end.

Current approach is to report absolute offsets from the start of the file as bytes, unicode handling is on the client, this simple python 3 script helps validating the ranges:

import argparse

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('input', type=str)
    parser.add_argument('start', type=int)
    parser.add_argument('end', type=int)

    args = parser.parse_args()

    with open(args.input, 'rb') as _:
        contents = _.read()

    try:
        print ('[' + contents[args.start:args.end].decode('utf8') + ']')
    except IndexError:
        print ("File too short")
    except UnicodeDecodeError as e:
        print ("Invalid start offset, should start at the beginning of a code point, error was:")
        print (e)

I didn't add any API yet, but modified the xml output to report these offsets, and disabled the only test that checks this renderer, in api_test.

The only intrusive commit with respect to actual parsing is MathieuDuponchelle@c6b706a : we want to parse inlines from unstripped lines, to have accurate positioning. This means I had to to add space skipping in a few strategic places to fix regressions.

I did say the patch was incomplete, we will want to handle emphasis insertions, links and a few other things, but I'm now pretty confident we can get accurate source positions for all elements in the AST in a simple and minimally intrusive manner.

Anyway, I won't start work on "passthrough" commonmark rendering before we have this, @jgm waiting for your initial comments, a priori the complexity of this is way lower than that of an extension API so I'm hopeful I can actually get something in this time :P

@MathieuDuponchelle
Copy link
Contributor

Pushed a few more commits, handling emphasis and inline links, still need to find:

  • The best approach for reference links, @jgm any reason why these can't be part of the AST?
  • What to associate link labels and link titles with, currently given:
[a](b)

we have:

  <paragraph start-extents="0:0" extents="0:0" end-extents="7:7">
    <link start-extents="0:1" extents="0:0" end-extents="2:7" destination="b" title="">
      <text start-extents="0:0" extents="1:2" end-extents="0:0">a</text>
    </link>
  </paragraph>

That works, but could be more helpful

@jgm
Copy link
Member

jgm commented Dec 1, 2016 via email

@MathieuDuponchelle
Copy link
Contributor

MathieuDuponchelle commented Dec 1, 2016

Disclaimer: that is experimental, based on your suggestion of having multiple pairs of offsets.

I'm coming at this from the "passthrough" perspective, which is my end goal. To illustrate this, let's take a simple block as an example:

#  a  ##

With the current code, we only have the information that the heading block extends from the first column of the first line to the fourth column of the first line (1:1-1:4)

We don't have any information about the text inline it contains.

With my current approach, here's what we now know:

  <heading sourcepos="1:1-1:4" start-extents="0:3" extents="0:0" end-extents="4:9" level="1">
    <text start-extents="0:0" extents="3:4" end-extents="0:0">a</text>
  </heading>

This tells us that:

  • The heading start marker extends from the first to the third byte of the original source file (# )
  • The text inline it contains is made of the fourth byte of the original source file (a)
  • The heading was suffixed by the fifth to the ninth byte of the original source file (` # \n'). This was interpreted as being part of the heading, but not parsed as a contained inline.

Here we can see that thanks to this new information, a few things are now made possible:

  • Passthrough rendering:
    Assuming this block was part of a larger document, and we only updated a different part of the AST, we can now safely output the exact same set of characters, making it possible to consider tooling that neither enforces a certain style on the user, nor needlessly pollutes version control.
    It also means such an input: # #a would not need to be needlessly escaped, in the current state the output would be # \#a

  • Smart escaping of content modified through the API:
    If the API user wants to modify the text of the title, we can now do so with context awareness, while still preserving the original formatting. For example, when using the set_title API, we could mark the node as "dirty", meaning the title should be rendered from the newly-set value, and at set-time only escape the relevant (trailing) hash symbols.

  • Accurate highlighting for potential visualizers:
    I think such tools already exist: given a markdown input and the resulting html rendered side by side, hovering over an element in the html highlights the source extents. With the current code, the trailing hashes do not get highlighted.

  • More far-fetched, zero-copy rendering:
    It would theoretically be possible, if the input data is guaranteed to stay in memory and not be deallocated, to render strings directly from that :)

Let's look at a few other cases which are currently more or less well handled:

a **b* c

yields

  <paragraph sourcepos="1:1-1:8" start-extents="0:0" extents="0:0" end-extents="9:9">
    <text start-extents="0:0" extents="0:2" end-extents="0:0">a </text>
    <text start-extents="0:0" extents="2:3" end-extents="0:0">*</text>
    <emph start-extents="3:4" extents="0:0" end-extents="5:6">
      <text start-extents="0:0" extents="4:5" end-extents="0:0">b</text>
    </emph>
    <text start-extents="0:0" extents="6:8" end-extents="8:9"> c</text>
  </paragraph>

This is working well, here start extents and end extents of the emphasis cover the * surrounding b

[a](b)

yields

  <paragraph sourcepos="1:1-1:6" start-extents="0:0" extents="0:0" end-extents="7:7">
    <link start-extents="0:1" extents="0:0" end-extents="2:7" destination="b" title="">
      <text start-extents="0:0" extents="1:2" end-extents="0:0">a</text>
    </link>
  </paragraph>

That one is a bit problematic for update, it does allow passthrough rendering as the "non-visible" part of the link is correctly marked as being ](b)\n, however I haven't yet decided on a solution to accurately split it up in order to allow for the destination's and potential label's extents to be available

[a]: b

[a]

yields

  <paragraph sourcepos="3:1-3:3" start-extents="8:8" extents="0:0" end-extents="12:12">
    <link start-extents="8:9" extents="0:0" end-extents="10:12" destination="b" title="">
      <text start-extents="0:0" extents="9:10" end-extents="0:0">a</text>
    </link>
  </paragraph>

This one is also problematic, as the reference is not part of the AST, which is this time a problem for passthrough rendering. I'm interested in discussion on this as well.

a


b

yields

  <paragraph sourcepos="1:1-1:1" start-extents="0:0" extents="0:0" end-extents="2:4">
    <text start-extents="0:0" extents="0:1" end-extents="1:2">a</text>
  </paragraph>
  <paragraph sourcepos="4:1-4:1" start-extents="4:4" extents="0:0" end-extents="6:6">
    <text start-extents="0:0" extents="4:5" end-extents="5:6">b</text>
  </paragraph>

This case is in my opinion correctly handled, as it does allow preserving blank lines

@jgm, I hope that helps showing what I'm going for here :)

@MathieuDuponchelle
Copy link
Contributor

MathieuDuponchelle commented Dec 2, 2016

I've pushed a few commits again, some of these examples are slightly obsolete (regarding ownership of line termination characters), however the design hasn't changed.

I've implemented a very simple passthrough renderer just to help validating the current code, and I'm pleased to say it now yields no diff at all when rendering alltests.md , I think that's a decent enough test case :D

The code for the passthrough renderer is at https://github.com/MathieuDuponchelle/cmark/blob/more-sourcepos/passthrough.c , it is not integrated at all with the build system, but it shouldn't be too hard to find out the correct compile command (gcc passthrough.c -L build/src/ -lcmark -I src/ -I build/src/ && LD_LIBRARY_PATH=build/src ./a.out alltests.md for example).

@jgm
Copy link
Member

jgm commented Dec 2, 2016

How do you handle extents for code blocks, which are divided over multiple lines?

>     code
>     more code

Here the code block occupies a discontinuous extent (4 characters at the end of the first line and 9 at the end of the second line). Since there's just one node for this, what value does extent get?

(This was the kind of thing that led me to suggest, early in this giant thread here, that perhaps we should focus on creating a source map for the whole document, e.g. an array mapping each source position to the deepest corresponding node, rather than attaching source position information to each node.)

@MathieuDuponchelle
Copy link
Contributor

Good question, and this kind of vertical discontinuity does break my approach. The argument for per-node extents is I think quite valid: as a user, I want to know about the semantic behind these extents, e.g. I want to know that the extents x:y associated to a link node is the link label, or the link destination, or the punctuation symbols in between. Hmmmm I'll need to think a bit more it seems :)

@MathieuDuponchelle
Copy link
Contributor

@jgm, I've started working on an alternate approach, with a per-parser extents list.

Given this input:

> a
> b

Here's the a representation of the current state:

0:1 - block_quote (0x17ff5f0)
1:2 - block_quote (0x17ff5f0)
2:2 - paragraph (0x17ff740)
2:3 - text (0x17ff400)
3:4 - softbreak (0x17ff4d0)
4:5 - block_quote (0x17ff5f0)
5:6 - block_quote (0x17ff5f0)
6:6 - paragraph (0x17ff740)
6:7 - text (0x17ff910)

This does provide a level of interesting information, however as I explained in the previous comment, I'm also interested in having the reverse mapping be available too, from node to extents, with semantic information, ie given a link node, what were the extents of its destination for example.

These extents are currently created in S_advance_offset, which takes a new parameter, the containing node. The function also now returns the new extent, and I figured the way to go would be to add new fields to individual node structures, in the case of a heading for example, we could set node->as.heading.marker_extents = S_advance_offset(...).

The other approach that would not require adding fields to the node structures would be to maintain a hashtable node -> list of extents, and add a tag attribute to the extent structure, but I figure that wouldn't be very elegant.

I'm curious about your opinion there ?

@nwellnhof
Copy link
Contributor

This should be fixed now after merging #228.

@jgm jgm closed this as completed Nov 18, 2017
bc-lee pushed a commit to bc-lee/cmark that referenced this issue Jul 12, 2024
* [Bugfix] Fix the backticks bug

* Add test for inline backtick parse

SR-15415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants