-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to write back code block backtick count to original? #20
Comments
I think this is closely related to #16 - maybe you can join the people who start talking to |
@Byron Do you mean that this is pulldown-cmark's matter? |
This means you can try for yourself how pulldown-cmark maintains information about backticks. If it does indeed not maintain any, then there is nothing we can do here. |
I use a5f644a . I understand that the (BTW, why the result of $git clone https://github.com/Byron/pulldown-cmark-to-cmark.git
Cloning into 'pulldown-cmark-to-cmark'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 347 (delta 2), reused 7 (delta 2), pack-reused 336
Receiving objects: 100% (347/347), 86.72 KiB | 275.00 KiB/s, done.
Resolving deltas: 100% (200/200), done.
$cd markdown_img_url_editor_rust/
$cargo install pulldown-cmark
Updating crates.io index
Downloaded pulldown-cmark v0.7.2
Downloaded 1 crate (102.7 KB) in 1.21s
Installing pulldown-cmark v0.7.2
Downloaded unicase v2.6.0
Downloaded memchr v2.3.3
Downloaded unicode-width v0.1.8
Downloaded bitflags v1.2.1
Downloaded version_check v0.9.2
Downloaded getopts v0.2.21
Downloaded 6 crates (110.0 KB) in 0.97s
Compiling version_check v0.9.2
Compiling bitflags v1.2.1
Compiling memchr v2.3.3
Compiling pulldown-cmark v0.7.2
Compiling unicode-width v0.1.8
Compiling getopts v0.2.21
Compiling unicase v2.6.0
Finished release [optimized] target(s) in 33.27s
Installing /home/yumetodo/.cargo/bin/pulldown-cmark
Installed package `pulldown-cmark v0.7.2` (executable `pulldown-cmark`)
$cat sample.md
`````markdown
````markdown
```typescript
console.log("arikitari na sekai");
```
````
`````
$cat sample.md | pulldown-cmark -e
0..90: Start(CodeBlock(Fenced(Borrowed("markdown"))))
14..85: Text(Borrowed("````markdown\n```typescript\nconsole.log(\"arikitari na sekai\");\n```\n````\n"))
0..90: End(CodeBlock(Fenced(Borrowed("markdown"))))
EOF
$cargo run --example stupicat -- sample.md | pulldown-cmark
Finished dev [unoptimized + debuginfo] target(s) in 0.01s
Running `target/debug/examples/stupicat sample.md`
<pre><code class="language-markdown">````markdown
```typescript
console.log("arikitari na sekai");
```
</code></pre>
<pre><code></code></pre> |
I don't know either, but I don't entirely know what the examples are supposed to show. That is, I don't understand what's expected, and what you think should be done. As this issue is about backtick counts and it's clear that If the example above indicates another bug in |
OK. I created pulldown-cmark/pulldown-cmark#461 so that I will wait that responce. |
pulldown-cmark's collaborator says that count it by yourself using
|
Fantastic. With this it is certainly possible to make the requested change. |
Of course, they are not always equal between begin count and end count of code block backtick like below. And also, tilde( ```text
aaa
```` |
Won't this require changing the API? The |
Good point! I think it could be done by optionally passing the input buffer as part of the Options. There using the range information should be a breeze. |
In that case, it's better to get the diff between the event stream created from the original input text and the passed event stream... |
That sounds like a post-process entirely unrelated to this crate, or am I missing something? |
I'm concern that the index information in the range is valid when the passed event stream is modified from the original. |
Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text. The two main use cases I am thinking of are (a) mdbook preprocessors, and (b) Markdown code formatters. (I'm interested in the latter.) In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way. My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs. |
Absolutely, the provided input buffer must match what's actually parsed, and it's something the programmer has to assure. That said, I would entrust then to be able to do that or notice very quickly that something is wrong. In my understanding this crate should do as much if not all of the 'heavy lifting', but I can only acknowledge that it might be a different story when actually implementing it.
This sounds like a heuristic to me that is based on the observed code block? Maybe this could be elaborated so we are on the same page. Just to sum up what I got so far as options to solve it this issue:
Something I would like to highlight in parting is a comment of @lambda-fairy as it helps me to see the value behind this issue:
This is something that ideally is easy because the input events are made so that they don't degenerate information, but it turns out that's not a goal of the input parser. This crate could be the remedy and I would love to find a general purpose solution to this issue. To me, option 1 is able to do it for sure and provides all information needed to apply more fixes in future, too. I would hope we could get to a point where any of the options really is implemented as proof of concept so we can eliminate those who might not work in practice and make progress towards resolving this issue. |
When I first created this issue, I was thinking of a scenario where pulldown-cmark would parse it, edit some of it, and pulldown-cmark-to-cmark would convert it back to markdown(yumetodo/markdown_img_url_editor#40 (Written in Japanese)). However, now I got it that is too difficult to implement. For such a use case, we should use Now I agree to reduce the scope of this issue to the two main use cases @lambda-fairy says. |
I have created a sketch, see #22, to show how this could work using the 'Range and buffer' method. This is the only way I would know how to do what's needed for this issue, and in future. Maybe those who have other ideas can sketch them and submit PRs so we all know what we are talking about. I am particularly interested in learning what @yumetodo thinks about the PR (please feel free to comment there) as this supports exactly what they have done. |
As another instance of this issue, I am writing an mdBook preprocessor using
Which correctly renders as:
However, this package will replace the double-backtick with a single backtick, producing:
Which incorrectly renders as:
|
Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate. |
That's fantastic, thanks a lot! I pulled the changes and saw a test was added for validation, and besides that, the spec-test is up 2 tests as well to 425/649 passed! I'd think that's a very measurable improvement worth a new release in case it helps. |
Hi @willcrichton, I believe the case of inline code with backticks has been fixed by #56.
@yumetodo, I'm just another user of the library — so this is just my opinion 😄 I would not expect the output to be equal after parsing and serializing back to Markdown since the Markdown AST might not preserve all information. However, I would expect it to be equal if you parse it again and serialize to Markdown. I did some experiments on this earlier this year in #55. That lead me to find a few corner cases in this library and also in pulldown-cmark. |
I'm interesting in this approach. fn check_code_block_token_count(group: &[(usize, Event)]) -> usize {
let events = group.iter().map(|(_, event)| event);
let mut in_codeblock = false;
let mut max_token_count = 0;
// token_count should be taken over Text events
// because a continuous text may be splitted to some Text events.
let mut token_count = 0;
for event in events {
match event {
Event::Start(Tag::CodeBlock(_)) => {
in_codeblock = true;
token_count = 0;
}
Event::End(Tag::CodeBlock(_)) => {
in_codeblock = false;
token_count = 0;
}
Event::Text(x) if in_codeblock => {
for c in x.chars() {
if c == '`' {
token_count += 1;
} else {
max_token_count = std::cmp::max(max_token_count, token_count);
token_count = 0;
}
}
max_token_count = std::cmp::max(max_token_count, token_count);
}
_ => token_count = 0,
}
}
if max_token_count < 3 {
// default code block token is "```" which is 3
3
} else {
// If there is "```" in codeblock, codeblock token should be extended.
max_token_count + 1
}
} I think a code like the above may be useful if it is placed in this crate.
let code_block_token_count = pulldown_cmark_to_cmark::code_block_token_count(events);
let options = Options {
code_block_token_count,
..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);
let options = Options {
code_block_token_count: None, // "None" means the appropriate count is calcurated
// in cmark_resume_with_options
..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options); How about these ideas? |
Thanks for the writeup and for sharing! The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events? In any case, if you think having more support for this would help here, even if its scope is limited, then adding it to the crate should be useful. However, I think it should be done in a backward compatible way that also doesn't affect performance. Thus is sounds like option 1 (freestanding function) would be the way to go. |
In my understanding, fenced code block can't be nested. For example, I think the following code block is not a nested code block but a single code block which have `````markdown
````markdown
```python
```
````
````` My goal is not "reproducing the original backtick count" but "generating a valid markdown".
`````markdown
````markdown
```python
```
````
`````
````markdown
```rust
```
````
```c
```
`````markdown
````markdown
```python
```
````
`````
`````markdown
```rust
```
`````
`````c
`````
|
Thanks for the clarification and the examples, I think we are on the same page now. Then it's a definitive "Yes," please submit a PR with a new free function that helps to set the correct backtick count based on a pre-scan of all known events. The |
Thank you for your suggestion. |
@Byron So even If we want to set backtick count to 3,
I think two solutions, but both are breaking change.
let code_block_token_count = count_code_block_tokens(events, 3);
let code_block_token_count = count_code_block_tokens(events).unwrap_or(3); |
Alternatively the default value can be changed to 3. lf user want to use 4, the follow code is available. let code_block_token_count = count_code_block_tokens(events);
let code_block_token_count = std::cmp::max(code_block_token_count, 4); This is not breaking change. |
Thanks for bringing this up, and sorry for not catching it. I already yanked the previous release so the fix isn't breaking. My preference is to go with Thanks again for your help. |
Thank you. |
After #17, #18, code block backtick count is configurable. However, there is another problem.
Now let's consider such senario like below:
ex.) https://github.com/yumetodo/markdown_img_url_editor_rust/blob/master/src/lib.rs#L74
In this case, there is nothing to emulate block backtick count.
As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.
The text was updated successfully, but these errors were encountered: