Implement Unicode class arbitrary #991

addisoncrump · 2023-05-06T00:26:24Z

See late discussion in #848

This effectively copies my regex-automata work into this crate and does a bunch of rejiggering to make it work. In particular, we wire up its new test harness to the public regex crate API. In this commit, that means the regex crate API is being simultaneously tested using both the old and new test suites. This does *not* get rid of the old regex crate implementation. That will happen in a subsequent commit. This is just a staging commit to prepare for that.

If we need this again, we should just rewrite it in Rust and put it in 'regex-cli'.

All of the old tests should be covered by either porting them over explicitly, or in the TOML test suite.

We're going to drop the old benchmark suite in favor of rebar, but it's worth recording some final results. This ensures we get a fair comparison with the regex crate before and after its internals have been rewritten.

We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar

As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)

Closes #961

It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.

This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.

This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821

The fuzzer sometimes runs into situations where it builds regexes that can take a while to execute, such as `\B{10000}`. They fit within the default size limit, but the search times aren't great. But it's not a bug. So try to decrease the size limit a bit to try and prevent timeouts. We might consider trying to optimize cases like `\B{10000}`. A naive optimization would be to remove any redundant conditional epsilon transitions within a single epsilon closure, but that can be tricky to do a priori. The case of `\B{100000}` is probably easy to detect, but they can be arbitrarily complex. Another way to attack this would be to modify, say, the PikeVM to only compute whether a conditional epsilon transition should be followed once per haystack position. Right now, I think it is re-computing them even though it doesn't have to.

This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848

It turns out that the way we were dealing with quit states in the DFA was not quite right. Basically, if we entered a quit state and a match had been found, then we were returning the match instead of the error. But the match might not be the correct leftmost-first match, and so, we really shouldn't return it. Otherwise a regex like '\B.*' could match much less than it should. This was caught by a differential fuzzer developed in #848.

This adds a regression test for a bug found in the *old* regex crate that isn't present with the regex-automata rewrite. I discovered this while doing differential fuzzing. I didn't do a root cause analysis of the bug, but my guess is a literal optimization problem.

addisoncrump · 2023-05-06T00:38:03Z

I don't think this is quite complete... missing e.g. Changes_When_Casefolded

addisoncrump · 2023-05-06T01:04:17Z

I attempted to optimise this further by removing the recomputation of the full unicode table sizes every time the arbitrary was used, but this seemed to actually worsen things, so this is probably Good Enough™. There is potentially a further optimisation by "skipping" members of the unicode table (property values in particular) but I don't think this is worthwhile.

addisoncrump · 2023-05-16T01:06:40Z

Well, the force-push broke the cleanness of this a bit 😂 Would you let me know if you see any issues with the implementation I have so far? I'll port it over some point later.

BurntSushi · 2023-05-16T17:07:51Z

Yeah no worries I'll fix that. I've taken a brief break from fuzzing in favor of trying to complete the migration. (I have a lot of prose writing to do.) But I'll circle back around to this before regex 1.9.

I've also been interrupted a bit by historically bad seasonal allergies and lovely distractions such as Tears of the Kingdom. :)

BurntSushi · 2023-05-18T17:47:25Z

fuzz/Cargo.toml

-path = ".."
+arbitrary = { version = "1.3.0", features = ["derive"] }
+libfuzzer-sys = { version = "0.4.1", features = ["arbitrary-derive"] }
+regex = { path = "..", default-features = false, features = ["std"] }


@addisoncrump Out of curiosity, why did you setup the features this way? Shrinking regex down to just std removes a lot of stuff that is worth testing...

BurntSushi · 2023-05-18T18:01:58Z

So I tried this out, but I'm tempted to remove the ast_roundtrip fuzz target. It is producing waaaaay too many false positives and I just don't have the time to deal with them right now. At a higher level, the functionality that ast_roundtrip is testing is not particularly critical. It's a "nice to have" that isn't part of how the regex engines work.

I'll take this PR in for now (I cherry picked the commits), but I'm likely to remove the ast_roundtrip fuzzer. We can revisit it later.

Also probably a good idea not to submit more PRs to ag/regex-automata unfortunately. I'm still force pushing to it. I don't want to get it on to master yet.

BurntSushi · 2023-05-18T18:07:10Z

Ah okay, the false positives were a result of a flub on my part. I put env_logger::init() into each of the fuzzers, and that started panicking when I ran the fuzzer across multiple threads. Since ast_roundtrip was the first fuzzer I ran in a multi-threaded way since adding env_logger::init(), I falsely assumed it had a bunch of false positives. I'm running it now and all seems good for now.

BurntSushi · 2023-05-18T19:05:00Z

Ah yeah I see a problem. The Arbitrary impls are generating capture groups with invalid names:

Group(Group { span: Span(Position(o: 72056936891165757, l: 18446743197536222976, c: 18095463302774652928), Position(o: 47261417342, l: 9187201950444158975, c: 9187201950435737471)), kind: CaptureName { starts_with_p: true, name: CaptureName { span: Span(Position(o: 7731356377379602303, l: 15191295558507746002, c: 67961541330), Position(o: 5425479542451994624, l: 18446743297583696715, c: 5425513257945268223)), name: "KkKK>\0\0\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}?<\u{b}\u{b}\u{b}\u{b}\u{b}{\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}\u{b}555555}}}}}}}}}eeee\u{1}\0\0\0\0\0\0,", index: 1263206624 } }, ast: Empty(Span(Position(o: 0, l: 0, c: 0), Position(o: 0, l: 0, c: 0))) })

The AST printer assumes that the AST values are valid, including the capture group name, so it dumps it out naively. But there's a > in there that closes the capture group name.

I've added another manual Arbitrary impl for CaptureName to only generate correct capture names.

... and add some more fuzz testing based on it. Closes #991

BurntSushi and others added 19 commits May 4, 2023 21:16

impl: cut over to regex-automata

8fc68b8

scripts: remove 'frequencies' script

3b1cbf9

If we need this again, we should just rewrite it in Rust and put it in 'regex-cli'.

tests: drop old tests

3e1e57d

All of the old tests should be covered by either porting them over explicitly, or in the TOML test suite.

bench: record last results with old benchmark suite

9c68579

We're going to drop the old benchmark suite in favor of rebar, but it's worth recording some final results. This ensures we get a fair comparison with the regex crate before and after its internals have been rewritten.

bench: move the old recordings to 'record' directory

d7017c1

We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar

bench: remove the old harness

a5aab16

As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)

api: introduce new regex-lite crate

32a6f39

Closes #961

fuzz: improve fuzz testing

629565e

It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.

syntax: add new 'arbitrary' crate feature

2cf3999

This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.

fuzz: use structured fuzzer input

e6c8b96

This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821

fuzz: add syntactic structurally aware fuzzers

ea6b15d

This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848

progress: API and docs refresh

144c684

maybe implement arbitrary for unicode classes

2648e3a

correction on arbitrary computation

1fdb65d

learn to feature

1a17c87

add support for changes_when_x and friends

f936446

BurntSushi force-pushed the ag/regex-automata branch from 144c684 to 8f4af0b Compare May 14, 2023 16:24

BurntSushi force-pushed the ag/regex-automata branch 5 times, most recently from edf2fe6 to b2a0c9f Compare May 18, 2023 15:39

BurntSushi reviewed May 18, 2023

View reviewed changes

BurntSushi closed this May 18, 2023

BurntSushi pushed a commit that referenced this pull request May 18, 2023

fuzz: improve Arbitrary impl for Unicode classes

b84ad0b

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request May 18, 2023

fuzz: improve Arbitrary impl for Unicode classes

9c4ce43

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request May 21, 2023

fuzz: improve Arbitrary impl for Unicode classes

8efae67

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request May 22, 2023

fuzz: improve Arbitrary impl for Unicode classes

5fc732a

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request May 22, 2023

fuzz: improve Arbitrary impl for Unicode classes

0b3a371

... and add some more fuzz testing based on it. Closes #991

addisoncrump deleted the ag/regex-automata branch May 24, 2023 01:10

BurntSushi pushed a commit that referenced this pull request May 24, 2023

fuzz: improve Arbitrary impl for Unicode classes

08d903b

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request May 24, 2023

fuzz: improve Arbitrary impl for Unicode classes

5b178da

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request May 25, 2023

fuzz: improve Arbitrary impl for Unicode classes

f2158f4

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request Jun 13, 2023

fuzz: improve Arbitrary impl for Unicode classes

e34b914

... and add some more fuzz testing based on it. Closes #991

BurntSushi pushed a commit that referenced this pull request Jul 5, 2023

fuzz: improve Arbitrary impl for Unicode classes

cacdf37

... and add some more fuzz testing based on it. Closes #991

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Unicode class arbitrary #991

Implement Unicode class arbitrary #991

addisoncrump commented May 6, 2023

addisoncrump commented May 6, 2023

addisoncrump commented May 6, 2023

addisoncrump commented May 16, 2023

BurntSushi commented May 16, 2023 •

edited

Loading

BurntSushi May 18, 2023

BurntSushi commented May 18, 2023

BurntSushi commented May 18, 2023

BurntSushi commented May 18, 2023

Implement Unicode class arbitrary #991

Implement Unicode class arbitrary #991

Conversation

addisoncrump commented May 6, 2023

addisoncrump commented May 6, 2023

addisoncrump commented May 6, 2023

addisoncrump commented May 16, 2023

BurntSushi commented May 16, 2023 • edited Loading

BurntSushi May 18, 2023

Choose a reason for hiding this comment

BurntSushi commented May 18, 2023

BurntSushi commented May 18, 2023

BurntSushi commented May 18, 2023

BurntSushi commented May 16, 2023 •

edited

Loading