Add round-trip fuzz test #55

mgeisler · 2023-05-27T17:17:17Z

This checks that the library can accurately go from Markdown to events and back to the same Markdown. I'm trying to use the library in google/mdbook-i18n-helpers#19.

My assumption was that I can round-trip Markdown through the library. While there is a lot of Markdown input which can produce a given sequence of pulldown_cmark::Events, I was hoping to normalize the output by running it through cmark. So I hoped that

    let round_trip_1 = round_trip(&text);
    let round_trip_2 = round_trip(&round_trip_1);
    assert_eq!(round_trip_1, round_trip_2);

would hold, there round_trip runs text through cmark(Parser::new(&text), &mut result), see round-trip.rs.

The fuzz test fails. To run it yourself, install cargo fuzz and run

cargo fuzz run round-trip -- -only_ascii=1

Input like "**" becomes "\\**" after one iteration and then becomes "\\*\\*" after a second round of normalization:

"**" -> [
  Start(Paragraph)
  Text(Borrowed("**"))
  End(Paragraph)
]

The events become "\\**", which the become these events:

"\\**" -> [
  Start(Paragraph)
  Text(Borrowed("*"))
  Text(Borrowed("*"))
  End(Paragraph)
]

These events are finally turned into "\\*\\*".

I can work around this case by merging adjacent Text events — but if I do that, I find a more complex input that fails the fuzz test.

Is there a chance to have the library support this kind of round-tripping?

Byron

Thanks for bringing this up! I agree that round-tripping would be a great property, but believe that the markdown parser in its current incarnation at least is inherently lossy in a couple of places. I am sure by now this has been improved, and maybe entirely rectified, but it would require some some research to be sure of that.

With that said, I hope there are more deterministic ways to add roundtrip-testing, maybe even in cases that are known to not round-trip correctly, instead of using fuzzer that will also run in CI thanks to make tests.

My disposition here is to not merge due to non-determinism that could also be fully deterministic, but wanted to hear your thoughts first. Thanks for sharing.

This checks that the library can accurately go from Markdown to events and back to the same Markdown.

mgeisler · 2023-05-27T22:03:22Z

Thanks for bringing this up! I agree that round-tripping would be a great property, but believe that the markdown parser in its current incarnation at least is inherently lossy in a couple of places.

I agree that there will be a lossy step here: a list item will result in Start(Item) being emitted regardless of what kind of "bullet" you used in the Markdown (* foo vs - foo, ...).

That is expected and I'm actually talking about round-tripping after such normalization. That is why my fuzzer does this:

    let round_trip_1 = round_trip(&text);
    let round_trip_2 = round_trip(&round_trip_1);
    assert_eq!(round_trip_1, round_trip_2);

The hope would be that round_trip_1 is normalized Markdown where all lists are formatted the same (using Options::list_token. It is only when round-tripping this normalized Markdown a second time that we can expect equality.

However, some brief testing showed me that this is not the case. I've tried round-tripping more and more times in the hope of finding a fixpoint. However, it seems that I'm just generating more elaborate test cases 😄

Here is an input which changes shape at least five times:

"~~\\~~~`"
-> "~~~~~\\`"
-> "\n`````\n````"
-> "\n````\n````\n````"
-> "\n````\n````\n\n````\n````"

I updated the PR to have this code in case it's interesting.

My disposition here is to not merge due to non-determinism that could also be fully deterministic, but wanted to hear your thoughts first. Thanks for sharing.

Yes, definitey! This PR is not ready to be merged, let me mark it as a WIP to make it more clear to others.

The fuzz test here is more of a future goal, and perhaps a way to find small low-hanging fruit which spoil the round-tripping today.

mgeisler · 2023-05-27T22:07:32Z

The corresponding events to the "~~\\~~~"` input:

"~~\\~~~`" -> [
  Start(Paragraph)
  Text(Borrowed("~~"))
  Text(Borrowed("~~~"))
  Text(Borrowed("`"))
  End(Paragraph)
]
"~~~~~\\`" -> [
  Start(CodeBlock(Fenced(Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }))))
  End(CodeBlock(Fenced(Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }))))
]
"\n`````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  Text(Borrowed("````"))
  End(CodeBlock(Fenced(Borrowed(""))))
]
"\n````\n````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
]
"\n````\n````\n\n````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
]

It looks like a bug to me that Start(Paragraph) Text(Borrowed("~~")) ... End(Paragraph) is turned into something which is parsed as a Start(CodeBlock(..) next. I suppose the underlying problem here is that Options::code_block_token_count is a fixed integer — perhaps I could pick it conservatively based on the input? Something like counting the number of ~ or backticks int the input should help.

Byron · 2023-05-28T20:02:09Z

Thanks so much for elaborating! I understand now that after one round it could stabilize even if it isn't generally round-tripable. It would also be my expectation that issues with multi-step round-tripping are solely to be attributed to this crate rather than the parser.
I see great value in fixing this and appreciate your help with this. Maybe there could be a few tests that expose problems and make a fix feasible.

mgeisler · 2023-05-30T07:14:17Z

I see great value in fixing this and appreciate your help with this. Maybe there could be a few tests that expose problems and make a fix feasible.

Yeah, I'll add normal tests for individual cases in a new PR. I see this PR as a bit of a test-generating-machine 😄

Byron · 2023-05-30T09:01:07Z

The "using a fuzzer to generate test-cases for invariants" definitely goes into the tool-belt :). Thanks for pointing it out.

mgeisler · 2023-06-03T15:19:06Z

The "using a fuzzer to generate test-cases for invariants" definitely goes into the tool-belt :). Thanks for pointing it out.

Yeah, it's a new tool for me too! 😄 It's something I want to try applying more in various projects since it super powerful.

mgeisler force-pushed the fuzz branch from bc46514 to 581390c Compare May 27, 2023 17:19

Byron reviewed May 27, 2023

View reviewed changes

Add round-trip fuzz test

300fdfb

This checks that the library can accurately go from Markdown to events and back to the same Markdown.

mgeisler force-pushed the fuzz branch from 581390c to 300fdfb Compare May 27, 2023 22:02

mgeisler marked this pull request as draft May 27, 2023 22:03

mgeisler mentioned this pull request May 30, 2023

Normalize Markdown in .pot files google/mdbook-i18n-helpers#19

Closed

This was referenced Jun 3, 2023

Handle inline code spans with multiple backticks #56

Merged

Parsing inline code span with a single newline loses the newline pulldown-cmark/pulldown-cmark#655

Closed

mgeisler mentioned this pull request Dec 22, 2023

Is it possible to write back code block backtick count to original? #20

Open

kdarkhan mentioned this pull request May 27, 2024

Discussion: support round trips? pulldown-cmark/pulldown-cmark#892

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add round-trip fuzz test #55

Add round-trip fuzz test #55

mgeisler commented May 27, 2023

Byron left a comment

mgeisler commented May 27, 2023

mgeisler commented May 27, 2023

Byron commented May 28, 2023

mgeisler commented May 30, 2023

Byron commented May 30, 2023

mgeisler commented Jun 3, 2023

Add round-trip fuzz test #55

Are you sure you want to change the base?

Add round-trip fuzz test #55

Conversation

mgeisler commented May 27, 2023

Byron left a comment

Choose a reason for hiding this comment

mgeisler commented May 27, 2023

mgeisler commented May 27, 2023

Byron commented May 28, 2023

mgeisler commented May 30, 2023

Byron commented May 30, 2023

mgeisler commented Jun 3, 2023