Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to write back code block backtick count to original? #20

Open
yumetodo opened this issue Aug 4, 2020 · 31 comments
Open
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@yumetodo
Copy link

yumetodo commented Aug 4, 2020

After #17, #18, code block backtick count is configurable. However, there is another problem.

Now let's consider such senario like below:

ex.) https://github.com/yumetodo/markdown_img_url_editor_rust/blob/master/src/lib.rs#L74

  1. Parse markdown file with pulldown-cmark
  2. Edit somewhare.
  3. Write back to markdown using pulldown-cmark-to-cmark

In this case, there is nothing to emulate block backtick count.

As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.

@yumetodo yumetodo changed the title Is it possible to Is it possible to write back code block backtick count to original? Aug 4, 2020
@Byron
Copy link
Owner

Byron commented Aug 5, 2020

I think this is closely related to #16 - maybe you can join the people who start talking to pulldown-cmark to make not degenerating information a priority.
Otherwise there is no way to tell how many backticks were parsed.

@yumetodo
Copy link
Author

yumetodo commented Aug 5, 2020

@Byron Do you mean that this is pulldown-cmark's matter?
What information is lacked after passing the pulldown-cmark event system to resolve this issue?
Without the information, I cannot create an issue at the pulldown-cmark.

@Byron
Copy link
Owner

Byron commented Aug 6, 2020

This means you can try for yourself how pulldown-cmark maintains information about backticks. If it does indeed not maintain any, then there is nothing we can do here.

@yumetodo
Copy link
Author

yumetodo commented Aug 8, 2020

I use a5f644a . I understand that the Fenced event doesn't provide backtick count.

(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?

$git clone https://github.com/Byron/pulldown-cmark-to-cmark.git
Cloning into 'pulldown-cmark-to-cmark'...

remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 347 (delta 2), reused 7 (delta 2), pack-reused 336
Receiving objects: 100% (347/347), 86.72 KiB | 275.00 KiB/s, done.
Resolving deltas: 100% (200/200), done.

$cd markdown_img_url_editor_rust/

$cargo install pulldown-cmark
    Updating crates.io index
  Downloaded pulldown-cmark v0.7.2
  Downloaded 1 crate (102.7 KB) in 1.21s
  Installing pulldown-cmark v0.7.2
  Downloaded unicase v2.6.0
  Downloaded memchr v2.3.3
  Downloaded unicode-width v0.1.8
  Downloaded bitflags v1.2.1
  Downloaded version_check v0.9.2
  Downloaded getopts v0.2.21
  Downloaded 6 crates (110.0 KB) in 0.97s
   Compiling version_check v0.9.2
   Compiling bitflags v1.2.1
   Compiling memchr v2.3.3
   Compiling pulldown-cmark v0.7.2
   Compiling unicode-width v0.1.8

   Compiling getopts v0.2.21
   Compiling unicase v2.6.0
    Finished release [optimized] target(s) in 33.27s
  Installing /home/yumetodo/.cargo/bin/pulldown-cmark
   Installed package `pulldown-cmark v0.7.2` (executable `pulldown-cmark`)

$cat sample.md
`````markdown
````markdown
```typescript
console.log("arikitari na sekai");
```
````
`````

$cat sample.md | pulldown-cmark -e
0..90: Start(CodeBlock(Fenced(Borrowed("markdown"))))
14..85: Text(Borrowed("````markdown\n```typescript\nconsole.log(\"arikitari na sekai\");\n```\n````\n"))
0..90: End(CodeBlock(Fenced(Borrowed("markdown"))))
EOF

$cargo run --example stupicat -- sample.md | pulldown-cmark
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/examples/stupicat sample.md`
<pre><code class="language-markdown">````markdown
```typescript
console.log(&quot;arikitari na sekai&quot;);
```
</code></pre>
<pre><code></code></pre>

@Byron
Copy link
Owner

Byron commented Aug 8, 2020

(BTW, why the result of cargo run --example stupicat -- sample.md | pulldown-cmark is broken? Where is the last ````?

I don't know either, but I don't entirely know what the examples are supposed to show. That is, I don't understand what's expected, and what you think should be done.

As this issue is about backtick counts and it's clear that pulldown-cmark doesn't keep them, I would conclude there is nothing we can do here.

If the example above indicates another bug in pulldown-cmark-to-cmark, I think this should move to a new issue if there is interest in getting it fixed.

@yumetodo
Copy link
Author

yumetodo commented Aug 8, 2020

OK. I created pulldown-cmark/pulldown-cmark#461 so that I will wait that responce.

@yumetodo
Copy link
Author

pulldown-cmark's collaborator says that count it by yourself using Event::Start's range information.

if let (Event::Start(Tag::CodeBlock(CodeBlockKind::Fenced(..))), range) = event {
    let num_backticks = source_text[range].chars().take_while(|c| c == '`').count();
}

@Byron Byron added good first issue Good for newcomers help wanted Extra attention is needed labels Sep 1, 2020
@Byron
Copy link
Owner

Byron commented Sep 1, 2020

Fantastic. With this it is certainly possible to make the requested change.
I think it should be quite straightforward.

@yumetodo
Copy link
Author

yumetodo commented Sep 2, 2020

Of course, they are not always equal between begin count and end count of code block backtick like below. And also, tilde(~) is allowed as a code block fence char.

```text
aaa
````

@lambda-fairy
Copy link

Won't this require changing the API? The cmark function currently has no access to the original source text.

@Byron
Copy link
Owner

Byron commented Jan 18, 2021

Good point! I think it could be done by optionally passing the input buffer as part of the Options. There using the range information should be a breeze.

@yumetodo
Copy link
Author

In that case, it's better to get the diff between the event stream created from the original input text and the passed event stream...

@Byron
Copy link
Owner

Byron commented Jan 18, 2021

That sounds like a post-process entirely unrelated to this crate, or am I missing something?
Maybe knowing more about this would help to understand why you would discard a proposal which to my mind would allow using ranges as suggested by a pulldown-cmark maintainer.

@yumetodo
Copy link
Author

I'm concern that the index information in the range is valid when the passed event stream is modified from the original.
pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one. That is what @lambda-fairy pointed out, as far as I understand.

@lambda-fairy
Copy link

Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.

The two main use cases I am thinking of are (a) mdbook preprocessors, and (b) Markdown code formatters. (I'm interested in the latter.) In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.

My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.

@Byron
Copy link
Owner

Byron commented Jan 19, 2021

@yumetodo

I'm concern that the index information in the range is valid when the passed event stream is modified from the original.
pulldown-cmark maintainer's suggestion code requests source text that corresponds one-to-one.

@lambda-fairy

My concern with the proposed solution is that the type system doesn't enforce that the given text and event stream correspond to each other. It's too easy to either mess up the ranges in the event stream, or pass in an unrelated string, both of which can lead to subtle bugs.

Absolutely, the provided input buffer must match what's actually parsed, and it's something the programmer has to assure. That said, I would entrust then to be able to do that or notice very quickly that something is wrong. In my understanding this crate should do as much if not all of the 'heavy lifting', but I can only acknowledge that it might be a different story when actually implementing it.

Have we considered dynamically setting the number of backticks based on the contents of the code block? That will require buffering those contents, but will avoid the need to pass in source text.

This sounds like a heuristic to me that is based on the observed code block? Maybe this could be elaborated so we are on the same page.

Just to sum up what I got so far as options to solve it this issue:

  1. pass in a source buffer and use the provided ranges to read the portions of the source we are interested in
  2. diff source and output, or more generally, apply a post process (probably to be implemented in a different crate, can be part of this repo though)
  3. Set the number of backticks based on the contents of the code block (needs elaboration)

Something I would like to highlight in parting is a comment of @lambda-fairy as it helps me to see the value behind this issue:

In either case there is no need to preserve the original input exactly – only to ensure that the result parses in the same way.

This is something that ideally is easy because the input events are made so that they don't degenerate information, but it turns out that's not a goal of the input parser. This crate could be the remedy and I would love to find a general purpose solution to this issue. To me, option 1 is able to do it for sure and provides all information needed to apply more fixes in future, too.

I would hope we could get to a point where any of the options really is implemented as proof of concept so we can eliminate those who might not work in practice and make progress towards resolving this issue.

@yumetodo
Copy link
Author

yumetodo commented Jan 19, 2021

When I first created this issue, I was thinking of a scenario where pulldown-cmark would parse it, edit some of it, and pulldown-cmark-to-cmark would convert it back to markdown(yumetodo/markdown_img_url_editor#40 (Written in Japanese)).

However, now I got it that is too difficult to implement. For such a use case, we should use Parser::into_offset_iter() and not by this issue.

Now I agree to reduce the scope of this issue to the two main use cases @lambda-fairy says.

@Byron Byron mentioned this issue Jan 21, 2021
@Byron
Copy link
Owner

Byron commented Jan 21, 2021

I have created a sketch, see #22, to show how this could work using the 'Range and buffer' method. This is the only way I would know how to do what's needed for this issue, and in future.

Maybe those who have other ideas can sketch them and submit PRs so we all know what we are talking about.

I am particularly interested in learning what @yumetodo thinks about the PR (please feel free to comment there) as this supports exactly what they have done.

@willcrichton
Copy link
Contributor

willcrichton commented Jun 24, 2022

As another instance of this issue, I am writing an mdBook preprocessor using pulldown-cmark-to-cmark for rust-lang/book. The Rust book uses double-backticks for inline code to escape single-backticks, such as:

here we embed ``an error `message` <- with a backtick``

Which correctly renders as:

here we embed an error `message` <- with a backtick

However, this package will replace the double-backtick with a single backtick, producing:

here we embed `an error `message` <- with a backtick`

Which incorrectly renders as:

here we embed an error message <- with a backtick

@willcrichton
Copy link
Contributor

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

@Byron
Copy link
Owner

Byron commented Jun 28, 2022

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

That's fantastic, thanks a lot!

I pulled the changes and saw a test was added for validation, and besides that, the spec-test is up 2 tests as well to 425/649 passed! I'd think that's a very measurable improvement worth a new release in case it helps.

@mgeisler
Copy link
Contributor

Update: I implemented a fix in willcrichton/pulldown-cmark-to-cmark, although I am not confident in whether it satisfies all the possible configurations for this crate.

Hi @willcrichton, I believe the case of inline code with backticks has been fixed by #56.

As a result, currently, when just bypass pulldown-cmark to pulldown-cmark-to-cmark, input and output is not equal.

@yumetodo, I'm just another user of the library — so this is just my opinion 😄

I would not expect the output to be equal after parsing and serializing back to Markdown since the Markdown AST might not preserve all information. However, I would expect it to be equal if you parse it again and serialize to Markdown.

I did some experiments on this earlier this year in #55. That lead me to find a few corner cases in this library and also in pulldown-cmark.

@dalance
Copy link
Contributor

dalance commented Jan 24, 2024

  1. Set the number of backticks based on the contents of the code block (needs elaboration)

I'm interesting in this approach.
In google/mdbook-i18n-helpers#129, I wrote the following code to count backticks in code block:

fn check_code_block_token_count(group: &[(usize, Event)]) -> usize {
    let events = group.iter().map(|(_, event)| event);
    let mut in_codeblock = false;
    let mut max_token_count = 0;

    // token_count should be taken over Text events
    // because a continuous text may be splitted to some Text events.
    let mut token_count = 0;

    for event in events {
        match event {
            Event::Start(Tag::CodeBlock(_)) => {
                in_codeblock = true;
                token_count = 0;
            }
            Event::End(Tag::CodeBlock(_)) => {
                in_codeblock = false;
                token_count = 0;
            }
            Event::Text(x) if in_codeblock => {
                for c in x.chars() {
                    if c == '`' {
                        token_count += 1;
                    } else {
                        max_token_count = std::cmp::max(max_token_count, token_count);
                        token_count = 0;
                    }
                }
                max_token_count = std::cmp::max(max_token_count, token_count);
            }
            _ => token_count = 0,
        }
    }
    if max_token_count < 3 {
        // default code block token is "```" which is 3
        3
    } else {
        // If there is "```" in codeblock, codeblock token should be extended.
        max_token_count + 1
    }
}

I think a code like the above may be useful if it is placed in this crate.
For example, the following designs can be considered.

  • Add code_block_token_count as a free standing function which can be used like below
let code_block_token_count = pulldown_cmark_to_cmark::code_block_token_count(events);
let options = Options {
    code_block_token_count,
    ..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);
  • Add a token counting code which is used when an explicit token count is not provided to cmark* function
let options = Options {
    code_block_token_count: None, // "None" means the appropriate count is calcurated
                                  // in cmark_resume_with_options
    ..Options::default()
};
pulldown_cmark_to_cmark::cmark_resume_with_options(events, &mut markdown, state, options);

How about these ideas?

@Byron
Copy link
Owner

Byron commented Jan 24, 2024

Thanks for the writeup and for sharing!

The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?

In any case, if you think having more support for this would help here, even if its scope is limited, then adding it to the crate should be useful. However, I think it should be done in a backward compatible way that also doesn't affect performance. Thus is sounds like option 1 (freestanding function) would be the way to go.

@dalance
Copy link
Contributor

dalance commented Jan 24, 2024

The way I understand this is that the count of backticks depends on the nesting level of code blocks? If so, the proposed code also wouldn't reproduce the exact count of backticks as there can be different nesting levels in the provided events?

In my understanding, fenced code block can't be nested.
(Refer: https://spec.commonmark.org/0.30/#fenced-code-blocks)

For example, I think the following code block is not a nested code block but a single code block which have "````markdown\n```python\n```\n````" as a plane text.
Therefore, an appropriate token count can be judged by counting continuous backticks in the plane text.

`````markdown
````markdown
```python
```
````
`````

My goal is not "reproducing the original backtick count" but "generating a valid markdown".
My code counts the longest backticks in the events, so some of the generated backticks may be too long, but it is a valid markdown.

  • Original markdown
`````markdown
````markdown
```python
```
````
`````

````markdown
```rust
```
````

```c
```
  • Generated markdown by cmark and code_block_token_count function (In this case, code_block_token_count will be 5)
`````markdown
````markdown
```python
```
````
`````

`````markdown
```rust
```
`````

`````c
`````

@Byron
Copy link
Owner

Byron commented Jan 24, 2024

Thanks for the clarification and the examples, I think we are on the same page now.

Then it's a definitive "Yes," please submit a PR with a new free function that helps to set the correct backtick count based on a pre-scan of all known events.

The code_block_token_count (which is confusingly named given that it is about backticks in code-blocks) documentation could also be adjusted to refer to the new function, whose own documentation could make clear what it does (maybe even by reusing some of the explanation used in this thread).

@dalance
Copy link
Contributor

dalance commented Jan 24, 2024

Thank you for your suggestion.
I've opened #65.

@dalance
Copy link
Contributor

dalance commented Jan 25, 2024

@Byron
Sorry. I missed the problem of API design.
The API of #65 doesn't have the way to specify default backticks count.
In the current impl, it refers the hard-coded DEFAULT_CODE_BLOCK_TOKEN_COUNT.

So even If we want to set backtick count to 3, count_code_block_tokens surely returns 4 or more than 4.
In other word, when the function returns 4, we can't identify the following two cases.

  • There is no backtick in code block, so codeblock_token_count can be set to 3.
  • There is 3 consective backticks in code block, so codeblock_token_count can't be set to 3, it must be 4 or more than.

I think two solutions, but both are breaking change.

  • Pass default count to count_code_block_tokens
let code_block_token_count = count_code_block_tokens(events, 3);
  • Return Option<usize> from count_code_block_tokens
let code_block_token_count = count_code_block_tokens(events).unwrap_or(3);

@dalance
Copy link
Contributor

dalance commented Jan 25, 2024

Alternatively the default value can be changed to 3.
3 is the minimum value allowed in spec.

lf user want to use 4, the follow code is available.

let code_block_token_count = count_code_block_tokens(events);
let code_block_token_count = std::cmp::max(code_block_token_count, 4);

This is not breaking change.

@Byron
Copy link
Owner

Byron commented Jan 26, 2024

Thanks for bringing this up, and sorry for not catching it. I already yanked the previous release so the fix isn't breaking.

My preference is to go with Option<usize> as return value, it seems most idiomatic and flexible.

Thanks again for your help.

@dalance
Copy link
Contributor

dalance commented Jan 27, 2024

Thank you.
I've opened #66.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants