-
-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More sourcepos? #26
Comments
Currently we have source position (start and end line and column) for block-level elements, but not for inline-level elements. Storing source position for inline elements would require making a few parts of the parser more complex, and it might have an efficiency cost, but it could certainly be done. Indeed, it is already done (if I'm not mistaken) in Knagis/CommonMark.NET and perhaps in other implementations too. |
Hey @jgm, as I said in #131 I've started experimenting with offsets, see https://github.com/MathieuDuponchelle/cmark/commits/more-sourcepos This branch is incomplete, but I think it's a good first step in the right direction, pretty simple and already showing some results, given this input file:
The code in that branch doesn't get thrown off and reports correct extents for the inlines near the end. Current approach is to report absolute offsets from the start of the file as bytes, unicode handling is on the client, this simple python 3 script helps validating the ranges:
I didn't add any API yet, but modified the xml output to report these offsets, and disabled the only test that checks this renderer, in api_test. The only intrusive commit with respect to actual parsing is MathieuDuponchelle@c6b706a : we want to parse inlines from unstripped lines, to have accurate positioning. This means I had to to add space skipping in a few strategic places to fix regressions. I did say the patch was incomplete, we will want to handle emphasis insertions, links and a few other things, but I'm now pretty confident we can get accurate source positions for all elements in the AST in a simple and minimally intrusive manner. Anyway, I won't start work on "passthrough" commonmark rendering before we have this, @jgm waiting for your initial comments, a priori the complexity of this is way lower than that of an extension API so I'm hopeful I can actually get something in this time :P |
Pushed a few more commits, handling emphasis and inline links, still need to find:
we have:
That works, but could be more helpful |
To help me understand this, can you give a little narrative
about what these various extent properties on nodes represent,
with some examples?
|
Disclaimer: that is experimental, based on your suggestion of having multiple pairs of offsets. I'm coming at this from the "passthrough" perspective, which is my end goal. To illustrate this, let's take a simple block as an example:
With the current code, we only have the information that the heading block extends from the first column of the first line to the fourth column of the first line (1:1-1:4) We don't have any information about the text inline it contains. With my current approach, here's what we now know:
This tells us that:
Here we can see that thanks to this new information, a few things are now made possible:
Let's look at a few other cases which are currently more or less well handled:
yields
This is working well, here start extents and end extents of the emphasis cover the
yields
That one is a bit problematic for update, it does allow passthrough rendering as the "non-visible" part of the link is correctly marked as being
yields
This one is also problematic, as the reference is not part of the AST, which is this time a problem for passthrough rendering. I'm interested in discussion on this as well.
yields
This case is in my opinion correctly handled, as it does allow preserving blank lines @jgm, I hope that helps showing what I'm going for here :) |
I've pushed a few commits again, some of these examples are slightly obsolete (regarding ownership of line termination characters), however the design hasn't changed. I've implemented a very simple passthrough renderer just to help validating the current code, and I'm pleased to say it now yields no diff at all when rendering alltests.md , I think that's a decent enough test case :D The code for the passthrough renderer is at https://github.com/MathieuDuponchelle/cmark/blob/more-sourcepos/passthrough.c , it is not integrated at all with the build system, but it shouldn't be too hard to find out the correct compile command ( |
How do you handle extents for code blocks, which are divided over multiple lines?
Here the code block occupies a discontinuous extent (4 characters at the end of the first line and 9 at the end of the second line). Since there's just one node for this, what value does (This was the kind of thing that led me to suggest, |
Good question, and this kind of vertical discontinuity does break my approach. The argument for per-node extents is I think quite valid: as a user, I want to know about the semantic behind these extents, e.g. I want to know that the extents x:y associated to a link node is the link label, or the link destination, or the punctuation symbols in between. Hmmmm I'll need to think a bit more it seems :) |
@jgm, I've started working on an alternate approach, with a per-parser extents list. Given this input:
Here's the a representation of the current state:
This does provide a level of interesting information, however as I explained in the previous comment, I'm also interested in having the reverse mapping be available too, from node to extents, with semantic information, ie given a link node, what were the extents of its destination for example. These extents are currently created in S_advance_offset, which takes a new parameter, the containing node. The function also now returns the new extent, and I figured the way to go would be to add new fields to individual node structures, in the case of a heading for example, we could set The other approach that would not require adding fields to the node structures would be to maintain a hashtable node -> list of extents, and add a tag attribute to the extent structure, but I figure that wouldn't be very elegant. I'm curious about your opinion there ? |
This should be fixed now after merging #228. |
* [Bugfix] Fix the backticks bug * Add test for inline backtick parse SR-15415
It would be nice if there is more source position in the output, to the point where it is possible to track the source of every bit of text.
It seems to me that currently, this tool is overall very limited (eg, doesn't come with a huge number of extensions and command line arguments to switch them on and off). This looks like a good idea for something that is supposed to serve as a basis for some system that will extend it. In my case, I played with the idea of embedding it with a different markup system, which would result in having the best of both (convenience of cmark, flexibility of a markup when needed). In any case, one way to get something like that going without implementing yet another markdown (or commonmark) parser is to use an existing tool like cmark. But unfortunately this requires knowing exactly where each bit of text came from, not just the current per line thing (?). This is because my target system is a proper language, so source tracking is important for syntax errors etc.
Ideally, this could be done for
--smart
replacement text too...The text was updated successfully, but these errors were encountered: