Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add round-trip fuzz test #55

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Add round-trip fuzz test #55

wants to merge 1 commit into from

Conversation

mgeisler
Copy link
Contributor

This checks that the library can accurately go from Markdown to events and back to the same Markdown. I'm trying to use the library in google/mdbook-i18n-helpers#19.

My assumption was that I can round-trip Markdown through the library. While there is a lot of Markdown input which can produce a given sequence of pulldown_cmark::Events, I was hoping to normalize the output by running it through cmark. So I hoped that

    let round_trip_1 = round_trip(&text);
    let round_trip_2 = round_trip(&round_trip_1);
    assert_eq!(round_trip_1, round_trip_2);

would hold, there round_trip runs text through cmark(Parser::new(&text), &mut result), see round-trip.rs.

The fuzz test fails. To run it yourself, install cargo fuzz and run

cargo fuzz run round-trip -- -only_ascii=1

Input like "**" becomes "\\**" after one iteration and then becomes "\\*\\*" after a second round of normalization:

"**" -> [
  Start(Paragraph)
  Text(Borrowed("**"))
  End(Paragraph)
]

The events become "\\**", which the become these events:

"\\**" -> [
  Start(Paragraph)
  Text(Borrowed("*"))
  Text(Borrowed("*"))
  End(Paragraph)
]

These events are finally turned into "\\*\\*".

I can work around this case by merging adjacent Text events — but if I do that, I find a more complex input that fails the fuzz test.

Is there a chance to have the library support this kind of round-tripping?

Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up! I agree that round-tripping would be a great property, but believe that the markdown parser in its current incarnation at least is inherently lossy in a couple of places. I am sure by now this has been improved, and maybe entirely rectified, but it would require some some research to be sure of that.

With that said, I hope there are more deterministic ways to add roundtrip-testing, maybe even in cases that are known to not round-trip correctly, instead of using fuzzer that will also run in CI thanks to make tests.

My disposition here is to not merge due to non-determinism that could also be fully deterministic, but wanted to hear your thoughts first. Thanks for sharing.

This checks that the library can accurately go from Markdown to events
and back to the same Markdown.
@mgeisler
Copy link
Contributor Author

Thanks for bringing this up! I agree that round-tripping would be a great property, but believe that the markdown parser in its current incarnation at least is inherently lossy in a couple of places.

I agree that there will be a lossy step here: a list item will result in Start(Item) being emitted regardless of what kind of "bullet" you used in the Markdown (* foo vs - foo, ...).

That is expected and I'm actually talking about round-tripping after such normalization. That is why my fuzzer does this:

    let round_trip_1 = round_trip(&text);
    let round_trip_2 = round_trip(&round_trip_1);
    assert_eq!(round_trip_1, round_trip_2);

The hope would be that round_trip_1 is normalized Markdown where all lists are formatted the same (using Options::list_token. It is only when round-tripping this normalized Markdown a second time that we can expect equality.

However, some brief testing showed me that this is not the case. I've tried round-tripping more and more times in the hope of finding a fixpoint. However, it seems that I'm just generating more elaborate test cases 😄

Here is an input which changes shape at least five times:

"~~\\~~~`"
-> "~~~~~\\`"
-> "\n`````\n````"
-> "\n````\n````\n````"
-> "\n````\n````\n\n````\n````"

I updated the PR to have this code in case it's interesting.

My disposition here is to not merge due to non-determinism that could also be fully deterministic, but wanted to hear your thoughts first. Thanks for sharing.

Yes, definitey! This PR is not ready to be merged, let me mark it as a WIP to make it more clear to others.

The fuzz test here is more of a future goal, and perhaps a way to find small low-hanging fruit which spoil the round-tripping today.

@mgeisler mgeisler marked this pull request as draft May 27, 2023 22:03
@mgeisler
Copy link
Contributor Author

The corresponding events to the "~~\\~~~"` input:

"~~\\~~~`" -> [
  Start(Paragraph)
  Text(Borrowed("~~"))
  Text(Borrowed("~~~"))
  Text(Borrowed("`"))
  End(Paragraph)
]
"~~~~~\\`" -> [
  Start(CodeBlock(Fenced(Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }))))
  End(CodeBlock(Fenced(Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }))))
]
"\n`````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  Text(Borrowed("````"))
  End(CodeBlock(Fenced(Borrowed(""))))
]
"\n````\n````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
]
"\n````\n````\n\n````\n````" -> [
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
  Start(CodeBlock(Fenced(Borrowed(""))))
  End(CodeBlock(Fenced(Borrowed(""))))
]

It looks like a bug to me that Start(Paragraph) Text(Borrowed("~~")) ... End(Paragraph) is turned into something which is parsed as a Start(CodeBlock(..) next. I suppose the underlying problem here is that Options::code_block_token_count is a fixed integer — perhaps I could pick it conservatively based on the input? Something like counting the number of ~ or backticks int the input should help.

@Byron
Copy link
Owner

Byron commented May 28, 2023

Thanks so much for elaborating! I understand now that after one round it could stabilize even if it isn't generally round-tripable. It would also be my expectation that issues with multi-step round-tripping are solely to be attributed to this crate rather than the parser.
I see great value in fixing this and appreciate your help with this. Maybe there could be a few tests that expose problems and make a fix feasible.

@mgeisler
Copy link
Contributor Author

I see great value in fixing this and appreciate your help with this. Maybe there could be a few tests that expose problems and make a fix feasible.

Yeah, I'll add normal tests for individual cases in a new PR. I see this PR as a bit of a test-generating-machine 😄

@Byron
Copy link
Owner

Byron commented May 30, 2023

The "using a fuzzer to generate test-cases for invariants" definitely goes into the tool-belt :). Thanks for pointing it out.

@mgeisler
Copy link
Contributor Author

mgeisler commented Jun 3, 2023

The "using a fuzzer to generate test-cases for invariants" definitely goes into the tool-belt :). Thanks for pointing it out.

Yeah, it's a new tool for me too! 😄 It's something I want to try applying more in various projects since it super powerful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants