Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unicode class arbitrary #991

Closed
wants to merge 20 commits into from
Closed

Implement Unicode class arbitrary #991

wants to merge 20 commits into from

Conversation

addisoncrump
Copy link
Contributor

See late discussion in #848

BurntSushi and others added 19 commits May 4, 2023 21:16
This effectively copies my regex-automata work into this crate and does
a bunch of rejiggering to make it work. In particular, we wire up its
new test harness to the public regex crate API. In this commit, that
means the regex crate API is being simultaneously tested using both the
old and new test suites.

This does *not* get rid of the old regex crate implementation. That will
happen in a subsequent commit. This is just a staging commit to prepare
for that.
If we need this again, we should just rewrite it in Rust and put it in
'regex-cli'.
All of the old tests should be covered by either porting them over
explicitly, or in the TOML test suite.
We're going to drop the old benchmark suite in favor of rebar, but it's
worth recording some final results. This ensures we get a fair
comparison with the regex crate before and after its internals have been
rewritten.
We are going to remove the old benchmark harness, but it seems like a
good idea to save the old measurements.

In the future, benchmarks will be maintained by rebar:
https://github.com/BurntSushi/rebar
As stated in a previous commit, we'll be moving to rebar. (rebar isn't
actually published at time of writing, but it's essentially ready to
go.)
It's still not as good as it could be, but we add fuzz targets for
regex-lite and DFA deserialization in regex-automata.
This feature makes all of the AST types derive the 'Arbitrary' trait,
which is in turn quite useful for fuzz testing.
This makes a couple of the fuzzer targets a bit nicer by just asking for
structured data instead of trying to manifest it ourselves out of a
&[u8].

Closes #821
The fuzzer sometimes runs into situations where it builds regexes that
can take a while to execute, such as `\B{10000}`. They fit within the
default size limit, but the search times aren't great. But it's not a
bug. So try to decrease the size limit a bit to try and prevent
timeouts.

We might consider trying to optimize cases like `\B{10000}`. A naive
optimization would be to remove any redundant conditional epsilon
transitions within a single epsilon closure, but that can be tricky to
do a priori. The case of `\B{100000}` is probably easy to detect, but
they can be arbitrarily complex.

Another way to attack this would be to modify, say, the PikeVM to only
compute whether a conditional epsilon transition should be followed once
per haystack position. Right now, I think it is re-computing them even
though it doesn't have to.
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
This adds a regression test for a bug found in the *old* regex crate
that isn't present with the regex-automata rewrite. I discovered this
while doing differential fuzzing. I didn't do a root cause analysis of
the bug, but my guess is a literal optimization problem.
@addisoncrump
Copy link
Contributor Author

I don't think this is quite complete... missing e.g. Changes_When_Casefolded

@addisoncrump
Copy link
Contributor Author

I attempted to optimise this further by removing the recomputation of the full unicode table sizes every time the arbitrary was used, but this seemed to actually worsen things, so this is probably Good Enough™. There is potentially a further optimisation by "skipping" members of the unicode table (property values in particular) but I don't think this is worthwhile.

@BurntSushi BurntSushi force-pushed the ag/regex-automata branch from 144c684 to 8f4af0b Compare May 14, 2023 16:24
@addisoncrump
Copy link
Contributor Author

Well, the force-push broke the cleanness of this a bit 😂 Would you let me know if you see any issues with the implementation I have so far? I'll port it over some point later.

@BurntSushi
Copy link
Member

BurntSushi commented May 16, 2023

Yeah no worries I'll fix that. I've taken a brief break from fuzzing in favor of trying to complete the migration. (I have a lot of prose writing to do.) But I'll circle back around to this before regex 1.9.

I've also been interrupted a bit by historically bad seasonal allergies and lovely distractions such as Tears of the Kingdom. :)

@BurntSushi BurntSushi force-pushed the ag/regex-automata branch 5 times, most recently from edf2fe6 to b2a0c9f Compare May 18, 2023 15:39
path = ".."
arbitrary = { version = "1.3.0", features = ["derive"] }
libfuzzer-sys = { version = "0.4.1", features = ["arbitrary-derive"] }
regex = { path = "..", default-features = false, features = ["std"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addisoncrump Out of curiosity, why did you setup the features this way? Shrinking regex down to just std removes a lot of stuff that is worth testing...

@BurntSushi
Copy link
Member

So I tried this out, but I'm tempted to remove the ast_roundtrip fuzz target. It is producing waaaaay too many false positives and I just don't have the time to deal with them right now. At a higher level, the functionality that ast_roundtrip is testing is not particularly critical. It's a "nice to have" that isn't part of how the regex engines work.

I'll take this PR in for now (I cherry picked the commits), but I'm likely to remove the ast_roundtrip fuzzer. We can revisit it later.

Also probably a good idea not to submit more PRs to ag/regex-automata unfortunately. I'm still force pushing to it. I don't want to get it on to master yet.

@BurntSushi BurntSushi closed this May 18, 2023
@BurntSushi
Copy link
Member

Ah okay, the false positives were a result of a flub on my part. I put env_logger::init() into each of the fuzzers, and that started panicking when I ran the fuzzer across multiple threads. Since ast_roundtrip was the first fuzzer I ran in a multi-threaded way since adding env_logger::init(), I falsely assumed it had a bunch of false positives. I'm running it now and all seems good for now.

@BurntSushi
Copy link
Member

Ah yeah I see a problem. The Arbitrary impls are generating capture groups with invalid names:

Group(Group { span: Span(Position(o: 72056936891165757, l: 18446743197536222976, c: 18095463302774652928), Position(o: 47261417342, l: 9187201950444158975, c: 9187201950435737471)), kind: CaptureName { starts_with_p: true, name: CaptureName { span: Span(Position(o: 7731356377379602303, l: 15191295558507746002, c: 67961541330), Position(o: 5425479542451994624, l: 18446743297583696715, c: 5425513257945268223)), name: "KkKK>\0\0\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}?<\u{b}\u{b}\u{b}\u{b}\u{b}{\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}555555}}}}}}}}}eeee\u{1}\0\0\0\0\0\0,", index: 1263206624 } }, ast: Empty(Span(Position(o: 0, l: 0, c: 0), Position(o: 0, l: 0, c: 0))) })

The AST printer assumes that the AST values are valid, including the capture group name, so it dumps it out naively. But there's a > in there that closes the capture group name.

I've added another manual Arbitrary impl for CaptureName to only generate correct capture names.

BurntSushi pushed a commit that referenced this pull request May 18, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request May 18, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request May 21, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request May 22, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request May 22, 2023
... and add some more fuzz testing based on it.

Closes #991
@addisoncrump addisoncrump deleted the ag/regex-automata branch May 24, 2023 01:10
BurntSushi pushed a commit that referenced this pull request May 24, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request May 24, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request May 25, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request Jun 13, 2023
... and add some more fuzz testing based on it.

Closes #991
BurntSushi pushed a commit that referenced this pull request Jul 5, 2023
... and add some more fuzz testing based on it.

Closes #991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants