-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Unicode class arbitrary #991
Implement Unicode class arbitrary #991
Conversation
This effectively copies my regex-automata work into this crate and does a bunch of rejiggering to make it work. In particular, we wire up its new test harness to the public regex crate API. In this commit, that means the regex crate API is being simultaneously tested using both the old and new test suites. This does *not* get rid of the old regex crate implementation. That will happen in a subsequent commit. This is just a staging commit to prepare for that.
If we need this again, we should just rewrite it in Rust and put it in 'regex-cli'.
All of the old tests should be covered by either porting them over explicitly, or in the TOML test suite.
We're going to drop the old benchmark suite in favor of rebar, but it's worth recording some final results. This ensures we get a fair comparison with the regex crate before and after its internals have been rewritten.
We are going to remove the old benchmark harness, but it seems like a good idea to save the old measurements. In the future, benchmarks will be maintained by rebar: https://github.com/BurntSushi/rebar
As stated in a previous commit, we'll be moving to rebar. (rebar isn't actually published at time of writing, but it's essentially ready to go.)
It's still not as good as it could be, but we add fuzz targets for regex-lite and DFA deserialization in regex-automata.
This feature makes all of the AST types derive the 'Arbitrary' trait, which is in turn quite useful for fuzz testing.
This makes a couple of the fuzzer targets a bit nicer by just asking for structured data instead of trying to manifest it ourselves out of a &[u8]. Closes #821
The fuzzer sometimes runs into situations where it builds regexes that can take a while to execute, such as `\B{10000}`. They fit within the default size limit, but the search times aren't great. But it's not a bug. So try to decrease the size limit a bit to try and prevent timeouts. We might consider trying to optimize cases like `\B{10000}`. A naive optimization would be to remove any redundant conditional epsilon transitions within a single epsilon closure, but that can be tricky to do a priori. The case of `\B{100000}` is probably easy to detect, but they can be arbitrarily complex. Another way to attack this would be to modify, say, the PikeVM to only compute whether a conditional epsilon transition should be followed once per haystack position. Right now, I think it is re-computing them even though it doesn't have to.
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make fuzzing much more targeted and complete. Closes #848
It turns out that the way we were dealing with quit states in the DFA was not quite right. Basically, if we entered a quit state and a match had been found, then we were returning the match instead of the error. But the match might not be the correct leftmost-first match, and so, we really shouldn't return it. Otherwise a regex like '\B.*' could match much less than it should. This was caught by a differential fuzzer developed in #848.
This adds a regression test for a bug found in the *old* regex crate that isn't present with the regex-automata rewrite. I discovered this while doing differential fuzzing. I didn't do a root cause analysis of the bug, but my guess is a literal optimization problem.
I don't think this is quite complete... missing e.g. |
I attempted to optimise this further by removing the recomputation of the full unicode table sizes every time the arbitrary was used, but this seemed to actually worsen things, so this is probably Good Enough™. There is potentially a further optimisation by "skipping" members of the unicode table (property values in particular) but I don't think this is worthwhile. |
144c684
to
8f4af0b
Compare
Well, the force-push broke the cleanness of this a bit 😂 Would you let me know if you see any issues with the implementation I have so far? I'll port it over some point later. |
Yeah no worries I'll fix that. I've taken a brief break from fuzzing in favor of trying to complete the migration. (I have a lot of prose writing to do.) But I'll circle back around to this before I've also been interrupted a bit by historically bad seasonal allergies and lovely distractions such as Tears of the Kingdom. :) |
edf2fe6
to
b2a0c9f
Compare
path = ".." | ||
arbitrary = { version = "1.3.0", features = ["derive"] } | ||
libfuzzer-sys = { version = "0.4.1", features = ["arbitrary-derive"] } | ||
regex = { path = "..", default-features = false, features = ["std"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@addisoncrump Out of curiosity, why did you setup the features this way? Shrinking regex
down to just std
removes a lot of stuff that is worth testing...
So I tried this out, but I'm tempted to remove the I'll take this PR in for now (I cherry picked the commits), but I'm likely to remove the Also probably a good idea not to submit more PRs to |
Ah okay, the false positives were a result of a flub on my part. I put |
Ah yeah I see a problem. The
The AST printer assumes that the AST values are valid, including the capture group name, so it dumps it out naively. But there's a I've added another manual |
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
... and add some more fuzz testing based on it. Closes #991
See late discussion in #848