Is the regex crate a bottleneck in your program? If so, can you share the details? #960

BurntSushi · 2023-02-23T19:14:12Z

BurntSushi
Feb 23, 2023
Maintainer

I am working on a regex benchmark (to be made public probably some time this year), and I'm looking to add regexes to it that are used in the real world. The benchmark will of course include regexes that I've conjured to test certain properties of regex implementations, but I would also like to balance that with real world regexes. I am especially interested in regexes that are used in performance sensitive aspects of your application. It doesn't necessarily have to be the number 1 bottleneck, but ideally it's impacting your bottom line performance in some way.

If you're willing to share, I would love to know both the regex and an example haystack that is searched. I would also like to know which regex APIs you use, for example, is_match, find or captures. Finally, if you could briefly explain what the regex is doing conceptually that would be awesome.

Note: many folks are responding which is great. But many are describing the bottleneck they hit instead of showing it. What I'm really asking for here is the actual regex, a (possibly sanitized) haystack, and the API you're using. An ideal submission is a short program that you agree recapitulates the essential performance problem you're facing. The haystack doesn't need to be big. 30MB should be more than enough for example.

osiewicz · 2023-02-23T21:25:57Z

osiewicz
Feb 23, 2023

Is pattern disclosure via an email ([email protected]) an option? I am aware that the regexes would end up in public resources in the end, but I'd be keen to discuss their use case outside of a public eye.

For my use case I care mostly about the performance of is_match of RegexSet. I currently use hyperscan, but I'd rather use either regex or regex_automata as it often comes close (or even beats) hyperscan - and regex_automata supports no_std, which is pretty handy (but not crucial). I believe HS has tons of matchers for very specific automatons which might give it an edge.

2 replies

BurntSushi Feb 23, 2023
Maintainer Author

Email is fine, yes. I would want to find a way to talk about what the regex is used for though. I think that's probably a requirement. As long it there's some (possibly vague) way of describing the problem it's trying to solve, and I can name the software/company that it's used in, then I think I'd be okay with including something like that.

RegexSet benchmarks are good. I'll probably only wind up with a few of them in the benchmark overall though, because most regex engines don't support them.

osiewicz Mar 7, 2023

Apparently we were comparing apples to oranges. According to @BurntSushi Hyperscan's defaults are quite different from this crate's defaults; namely, \w, \s and \d are ASCII-only by default in Hyperscan and Unicode-aware in Rust regex.

So on the topic of Hyperscan's \w and \s not supporting Unicode, yes,
indeed, that is the case. It matches PCRE2 behavior. And like PCRE2,
Hyperscan has a UCP option[1] that, when enabled, makes \w, \s and \d
Unicode aware. Similarly, in the regex crate, you can disable Unicode
mode to get the ASCII aware versions. For example, (?-u:\w) is
equivalent to [[:word:]] which is equivalent to [0-9A-Za-z_]. So
the difference between the regex crate and Hyperscan (and PCRE2) is
really just the defaults.
[1]: https://docs.rs/hyperscan/latest/hyperscan/struct.PatternFlags.html#associatedconstant.UCP

Furthermore, when we've benchmarked with ag/regex-automata branch (that switched aho-corasick to 1.0 and has some new literal optimizations), we saw approximately 2x improvement in our latency (from 13 microseconds to 6.5 microseconds) compared to 1.7.1. I believe we can close the gap by disabling Unicode properties to actually compare apples to apples.
We develop a complex event processing engine that spends a lot of time in regex implementation while processing mostly short haystacks (of less than 1kB).

Shnatsel · 2023-02-24T12:37:12Z

Shnatsel
Feb 24, 2023

cargo-audit currently uses the regex crate to extract panic messages from binaries in order to (partially) recover the dependency list and report vulnerabilities. The regex and the rest of the code can be found here. The sample haystacks are your .cargo/bin/ directory.

This comes with a big caveat that this is not the regex I wanted to write, but the regex that runs the fastest today when using the regex crate.

I ran less into the current implementation being slow (they're fast enough to make memory bandwidth the bottleneck) and more into the crate not exposing the APIs I need to avoid multiple scans over the input. I need to account for both / and \ path separators, but I could not write a fast regex that would implement that, so I use two regexes - one for Windows paths and the other for Unix. And since there's no captures_iter() on RegexSet, I have to run the full scan twice and do two full reads from memory.

IIRC I tried using [/\\\] everywhere for the paths which was very slow, and doing windows_regex|unix_regex was as slow as two passes, so I decided to keep things simple and just do two passes, and short-circuit if the first one matches. If either of those were faster today, I'd use them instead of two different regexes.

The code and detailed notes about performance can be found here.

8 replies

BurntSushi Feb 25, 2023
Maintainer Author

I plopped this into my benchmark harness in the following way:

I used 4 regexes. Your Unix regex, your Windows regex, then the Unix regex but with / replaced with [/\\], and finally an alternation of the Unix and Windows regexes.
Two distinct haystacks: the ripgrep 13 Linux/musl binary and the ripgrep 13 Windows/msvc binary. I only used the Windows binary for the benchmark of the Windows regex.

Here are the results, with regexold being the status quo and rust/regex/meta being the new engine:

$ rebar cmp tmp/results.csv -e meta -e regexold
benchmark                                  rust/regex/meta    rust/regexold
---------                                  ---------------    -------------
wild/rustsec-cargo-audit/both-alternate    26.3 GB/s (1.00x)  4.8 GB/s (5.47x)
wild/rustsec-cargo-audit/both-slashes      27.1 GB/s (1.00x)  5.1 GB/s (5.35x)
wild/rustsec-cargo-audit/original-unix     24.9 GB/s (1.00x)  9.5 GB/s (2.62x)
wild/rustsec-cargo-audit/original-windows  23.9 GB/s (1.00x)  8.7 GB/s (2.74x)

So you'll get a pretty big boost here and you'll be able to simplify your code. Although, the regexes that you said you tried and were slow don't appear to be that slow. It looks like they're about twice as slow as just the Unix regex though. I suppose if you're running this over a huge corpus of binaries, that could turn, say, a 2 minute search into a 1 minute search, which probably does indeed feel like a huge win. But yeah, still probably fast enough to at least exhaust I/O bandwidth, but nothing here touches memory bandwidth (at least on my system).

Unfortunately, your regex just happens to be susceptible to heuristics. Your Unix regex has a nice long prefix. But when you use [/\\] for your slashes instead (or make a big alternation), then you wind up with multiple literals as your prefix. The status quo doesn't notice that cargo is a common prefix, but the new meta regex engine does. So the status quo is actually doing a multiple literal search instead of a single literal search for cargo. Otherwise, you do also appear to be paying for resolving capture groups. At least in the new engine, I can get it up to 35 GB/s if I only ask for the bounds of the match.

When the new meta regex engine lands, this could potentially go even faster by rewriting the regex to be "one-pass," but I unfortunately don't see an easy way to do that here.

Anyway, thank you, this will eventually be part of the benchmark suite. :-)

BurntSushi Feb 25, 2023
Maintainer Author

The other reason for the speed up here is that the new meta engine uses the bounded backtracker to resolve capture groups, while the status quo uses the slower PikeVM. I'm not sure exactly why that is, but it could have to do with the fact that the new NFA only has 46 states where as the old one has 66, due to various optimizations and changes in state representation.

And since this regex is almost one-pass, the backtracker likely zips through this pretty quickly.

Dietr1ch Feb 25, 2023

The status quo doesn't notice that cargo is a common prefix, but the new meta regex engine does. So the status quo is actually doing a multiple literal search instead of a single literal search for cargo

(Disclaimer: I'm way out of my depth here with no experience on real-world regex engines)

Have you thought about exposing an API to build the ~automaton graph directly? In cases where the regex to automata generation is not clever enough it might be useful to drop this outer layer.

A while back I needed to match paths in an RDF graph, and ditching ~regexes to work directly on the automaton gave me a lot of control to fine-tune the queries and saved me from writing a regex engine and optimizing it.

I'm thinking that here, it could be neat to use the regex to get the automaton object, visualize it, and get the code to generate it using the automaton-oriented API so you can try to refactor it.

BurntSushi Feb 26, 2023
Maintainer Author

Ah, so you want to get rid of those pesky heuristics and become master of your own destiny! Well, you'll be able to do that using regex-automata::meta::Regex, but you'll deserve what you get. :-)

$ rebar cmp tmp/results.csv
benchmark                                  rust/regex/meta      rust/regexold
---------                                  ---------------      -------------
wild/rustsec-cargo-audit/both-alternate    811.5 MB/s (6.07x)   4.8 GB/s (1.00x)
wild/rustsec-cargo-audit/both-slashes      812.8 MB/s (6.44x)   5.1 GB/s (1.00x)
wild/rustsec-cargo-audit/original-unix     812.8 MB/s (11.38x)  9.0 GB/s (1.00x)
wild/rustsec-cargo-audit/original-windows  796.7 MB/s (11.28x)  8.8 GB/s (1.00x)

Notice how without the heuristics and just using the plain old automaton (a lazy DFA in this case), you'll get quite a bit slower than the status quo. You can't really hack a DFA to make it go faster. While there are certainly cases where the lazy DFA falls over due to the huge size of a regex (or pathological cases where NFA->DFA powerset construction is exponential), generally speaking, it tops out at the speeds shown above. You can't really change it to go any faster. The NFA is likely susceptible to some level of hacking, but if the NFA is doing the bulk of the work, then you're probably an order of magnitude slower than even what rust/regex/meta is above.

In order for your regex to be "fast," it really needs the regex engine (or whatever) to do the prefilter search. You could do that yourself and then invoke the automaton for each candidate you find, but then you're (probably, depending on your regex) exposing yourself to worst case quadratic behavior in the size of the haystack. It's quite difficult to compose these types of things. This is why each internal regex engine in regex is responsible for running its own prefilter (generally speaking) instead of it being the responsibility of same higher level layer.

With that said, the narrow answer to your question is that regex-automata will expose the regex engines to you directly, whether via a lazy DFA or a fully compiled DFA. You could then give it your own prefilter, or even drop down into walking the automaton yourself. If you're walking the automaton yourself, then you can run your own prefilter without being susceptible to worst case quadratic behavior by only running it when your automaton is in a start state.

Interpreting your question broadly, you probably want access to the heuristics themselves. I don't know how feasible that is. Literal searching is a dark art and the number of knobs is pretty enormous. It's like asking rustc to give you access to LLVM's register allocator. (Perhaps not the same magnitude of complexity, but you get the idea.)

orf Feb 26, 2023

I don’t really have much to add technically to this specific discussion, but I would really like to extend my gratitude @BurntSushi for your continual output of interesting/enlightening content, via comments like this and your incredible libraries.

Even though this area isn’t my speciality, I find your writings about the problem space and optimisations very interesting. Thank you.

eminence · 2023-02-24T15:54:52Z

eminence
Feb 24, 2023

I use regex to extract out about 35 fields from an haproxy logfile. regex is a "bottleneck" but only because my code does very little processing after regex (so there's not really anything to fault regex for). I want regex to be faster, but only in the context of "I want literally everything to be faster" 😁

The regex is here and if you're interested I can provide some arbitrarily large haystacks

5 replies

BurntSushi Feb 24, 2023
Maintainer Author

YES! Love it. This looks great. If you could provide a haystack that you feel is representative of the real world, that would be amazing.

It doesn't have to be huge. These are micro-benchmarks, so somewhere in the 5MB-30MB range is just fine. (If you aren't sure, give me something big and I can trim it down.)

Also, how do you use the regex? As in, which API? It looks like probably captures_iter?

eminence Feb 24, 2023

Oh yes, Sorry I forgot to say: I'm parsing pretty big files line by line, and then using captures on each line. Something like this:

let reader = BufReader::new(my_file)?;
for line in reader.lines() {
    let line = line?;
    if let Some(caps) = RE_V4.captures(line).or_else(|| RE_V2.captures(line)) {
        // do a bunch of stuff like:
        let url = caps.name("url").unwrap();
    }
}

About 80% of the lines in my file match one of these two regex, 20% don't match either.

Here's an example about what one matching line looks like:

Feb 20 00:00:51 ip-11-9-101-231.us-east-4.compute.internal security-proxy[21324]: v4 [20/Feb/2023:00:00:51.604] HAPEE-SECURITY src=127.0.0.1 threshold=4000,trafic=1,trafic_agg=0,ddos_status=ok,ip_denylisted=no,cookie=no,x-forwarded-tenant=A11223344556677,cpu_ms_tot=1,cpu_calls=186,timer=96/82 fe_main~ be_app/app1 0/1/1/7/9 401 2620 - - ---- 21/16/13/13/0 0/0 {A11223344556677|} "GET /hello_world HTTP/1.1" Response=- Response=-

And a non-matching line might look like:

Feb 20 00:00:54 ip-11-9-101-231.us-east-4.compute.internal security-proxy[21324]: @modsecurity: {"identifier":"waf_blocking","frontend":"fe_main","audit_log":{"transaction":{"client_ip":"127.0.0.1","time_stamp":"Mon Feb 20 00:00:54 2023","server_id":"b1946ac92492d2347c6235b4d2611184","client_port":46230,"host_ip":"11.9.191.231","host_port":443,"unique_id":"167685125470.215451","request":{"method":"POST","http_version":1.1,"uri":"/login","headers":{"content-type":"application/x-www-form-urlencoded","cache-control":"no-cache","accept":"text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2","pragma":"no-cache","user-agent":"Java/17.0.5","host":"backend1.server.example.com","content-length":"493"}},"response":{"http_code":400,"headers":{"HTTP/1.1":"400 ","access-control-max-age":"600","x-frame-options":"SAMEORIGIN","date":"Mon, 20 Feb 2023 00:00:54 GMT","access-control-allow-methods":"GET, POST, OPTIONS, HEAD, PUT, DELETE, PATCH","server":"Microsoft-IIS/7.0","access-control-allow-credentials":"true","strict-transport-security":"max-age=63072000; includeSubdomains; preload","access-control-allow-headers":"accept,x-requested-method,origin,x-requested-with,x-request,cache-control,content-type,last-modified,x-utc-offset,SecurityContext,X-AZ-CSRFTOKEN,keep-alive,authorization,X-CSRF-TOKEN,x-xsrf-token,AZ-API-Version,AZ-Client-Step-Name,AZ-Request-ID,ENO_CSRF_TOKEN,SecurityToken,AZ-Change-Authoring-Context, AZ-API-Version,AZ-Client-Step-Name,AZ-Request-ID,ENO_CSRF_TOKEN,SecurityToken,AZ-Change-Authoring-Context","set-cookie":"SERVERID=ServerAPI_0_9080; path=/; HttpOnly; SameSite=None; Secure","x-xss-protection":"1;mode=block","vary":"Origin","access-control-expose-headers":"AZ-Request-ID,X-AZ-CSRFTOKEN,_csrf,_csrf_header","content-length":"0","x-content-type-options":"nosniff"}},"producer":{"modsecurity":"ModSecurity v9.0.9 (Linux)","connector":"HAProxy ModSecurity version 9.9 2029/09/09 compiled with HAProxy version 9.9.9-1.0.0-999.999 '2029/09/09'\n","secrules_engine":"Enabled","components":["OWASP_CRS/9.9.9\""]},"messages":[]}}}

I have gigabytes worth of logs like this that I can sanitize and send to you, but maybe you can derive your own examplars from these two examples. Let me know

BurntSushi Feb 24, 2023
Maintainer Author

It would be great to just get say, a 30MB sanitized sample from a real log file. Basically, if it's available, I'd always rather real data than to make my own up.

eminence Apr 5, 2023

@BurntSushi I'm embarrassed to say this fell off my radar and I totally failed to follow through on this. Are you still interested in getting a real-life haystack example for my case? If so, I can get this ready in a few days

BurntSushi Apr 5, 2023
Maintainer Author

Yeah absolutely!

devsnek · 2023-02-24T19:10:54Z

devsnek
Feb 24, 2023

We have run into issues with the size of the compiled regexes. For example \w generates an incredible amount of bytecode due to how many unicode characters match it. Benchmarking the size of compiled regexes, in addition to their matching performance, seems like it could be worthwhile.

1 reply

BurntSushi Feb 24, 2023
Maintainer Author

Yes, the size is a problem, but it's not something that can realistically improve much AFAIK. It's also a very different (although connected) problem from the one I'm asking about here. You might consider that you're specifically asking for a very huge matching machine when you use \w. It is literally matching all characters considered part of words across a number of different written languages. You could turn off Unicode mode to just get the ASCII definition, or you could intersect it with a script if you know that's what you're searching. For example, for Greek text, [\w&&\p{Greek}] will be significantly smaller than just \w.

Now I do have some improvements landing that should help. Below, the first command shows the number of NFA states in the status quo. The second command is a new debug tool I'm working on that shows various summary data about the new NFA, including an order of magnitude decrease in the number of states:

$ regex-debug compile '\w' --dfa | wc -l
3574
$ regex-cli debug nfa thompson '\w' -q
      parse time:  8.143µs
  translate time:  8.865µs
compile nfa time:  179.295µs
      nfa memory:  17404
      nfa states:  312
   pattern count:  1
   capture count:  1

Basically, the new NFA compiles minimal UTF-8 automata. (Which you can confirm by building a DFA, minimizing it and comparing its states to the ones in the NFA.) Unfortunately, the same gains don't necessarily apply to the reverse case, and most regexes compile both a forward and reverse NFA:

$ regex-debug compile '\w' --dfa-reverse | wc -l
2234
$ regex-cli debug nfa thompson '\w' -qr --no-captures
      parse time:  7.945µs
  translate time:  6.694µs
compile nfa time:  96.822µs
      nfa memory:  37332
      nfa states:  1387
   pattern count:  1
   capture count:  0

It is possible to apply the shrinking algorithm in reverse, but in the reverse case, it causes NFA compilation time to balloon:

$ regex-cli debug nfa thompson '\w' -qr --no-captures --shrink
      parse time:  8.181µs
  translate time:  8.559µs
compile nfa time:  10.213498ms
      nfa memory:  119220
      nfa states:  489
   pattern count:  1
   capture count:  0

The other benefit from the status quo is that the new internals will compile two NFAs, where as today, three NFAs are compiled.

Overall these things lead to a... modest reduction in size. Consider this program for the status quo:

use regex::Regex;

fn main() {
    let mut res = vec![];
    for _ in 0..10_000 {
        res.push(Regex::new(r"\w").unwrap());
    }

    let haystack = std::env::args().nth(1).unwrap();
    let mut count = 0;
    for re in res.iter() {
        if re.is_match(&haystack) {
            count += 1;
        }
    }
    println!("{}", count);
}

and this one for the new internal regex engine:

use regex_automata::meta::Regex;

fn main() {
    let mut res = vec![];
    for _ in 0..10_000 {
        let re = Regex::new(r"\w").unwrap();
        let cache = re.create_cache();
        res.push((re, cache));
    }

    let haystack = std::env::args().nth(1).unwrap();
    let mut count = 0;
    for (re, cache) in res.iter_mut() {
        if re.is_match(cache, &haystack) {
            count += 1;
        }
    }
    println!("{}", count);
}

Compiling both in release mode and running them:

$ time ./target/release/regmem quux
10000

real    2.837
user    2.062
sys     0.773
maxmem  3917 MB
faults  0
$ time ./target/release/regmem-new quux
10000

real    2.628
user    1.934
sys     0.692
maxmem  3787 MB
faults  0

So... somewhat faster and a little less memory usage.

I think a better approach here would be for folks to file issues for cases where the memory usage is egregious and causing you actual problems. But not in this thread. I'm specifically asking about search performance here.

orf · 2023-02-25T02:05:15Z

orf
Feb 25, 2023

Please don’t judge me for this monstrosity (😅), but here is one I use to find potential AWS keys in code:

(('|\")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|\").*?(\n^.*?){0,4}(('|\")[a-zA-Z0-9+/]{40}('|\"))+|('|\")[a-zA-Z0-9+/]{40}('|\").*?(\n^.*?){0,3}('|\")((?:ASIA|AKIA|AROA|AIDA)([A-Z0-7]{16}))('|\"))+

the comment describes the thought process:

// This is a bit ridiculous, but it searches for the AWS access key pattern above combined with
// the secret key regex ([a-zA-Z0-9+/]{40}). This regex is too general, so we need to pair it with
// an access key match that is between 0 and 4 lines of the secret key match.
// It only looks for keys surrounded by quotes, else the false positive rate is too large.
// It also supports secret keys being defined before the access key.

I would use captures if I wasn’t outsourcing this to ripgrep. I ran this over every bit of code published to pypi and rubygems

2 replies

BurntSushi Feb 25, 2023
Maintainer Author

This is fantastic. Thank you!

ClementNerma Mar 3, 2023

I think you may be better off using Pomsky which provides a simpler and cleaner language which are then compiled to regex sources :)

OpatrilPeter · 2023-02-25T03:56:56Z

OpatrilPeter
Feb 25, 2023

Hello,

I believe I have some insights that may be helpful for you, since I've spend a lot of time tuning high-performance regex parsing application some months ago.

Brief background & motivation:

In our company, we produce large volume of unstructured logs in proprietary format and I often had to do adhoc analysis on files that are around 4GB.
I had good experiences with adhoc JSON transformations in jq, so I wrote regex-based parsing of our format as jq library, creating deeply nested rich objects per line. While the body of each message was arbitrary text, many kinds of messages like webserver per-request reports were common enough and useful to parse further, which could be done by speculatively parsing parts of the regex by a subregex if possible.
Overall, we could get from

2020/06/11 19:03:47.773257 I3: [34571]: server-fastrpc 'search("www.google.com", {abTestId: 0, acceptLanguage: "cs-CZ,cs;q=0.9,en-US...", bInAbTest: false, bestPlaCtrAt1: 0.00949335, bestTextCtrAt1: 0.288573, count: 3, deviceType: 1, enableRvs: false, gsids: -hidden-, includeEta: true, latitude: 49.20, longitude: 17.58, offset: 0, plaCount: 4, referer: "https://search.sezna...", sitelinkCount: 6, topCount: 4, userAgent: "Mozilla X11 Linux x8...", userId: 0, webId: 44567})': status=200, time=10.519ms, peer=127.0.0.1 {fastrpcinterface.cc:postProcess():100}

something like

{
  "timestamp": "2020/06/11 19:03:47.773257",
  "level": "I3",
  "body": {
    "type": "server_call",
    "message": {
      "interface": "server-fastrpc",
      "call_info": {
        "status": "200",
        "time": "10.519ms",
        "peer": "127.0.0.1"
      },
      "method": "search",
      "args": [
        "www.google.com",
        {
          "abTestId": 0,
          "acceptLanguage": "cs-CZ,cs;q=0.9,en-US...",
          "bInAbTest": false,
          "bestPlaCtrAt1": 0.00949335,
          "bestTextCtrAt1": 0.288573,
          "count": 3,
          "deviceType": 1,
          "enableRvs": false,
          "gsids": "-hidden-",
          "includeEta": true,
          "latitude": 49.20,
          "longitude": 17.58,
          "offset": 0,
          "plaCount": 4,
          "referer": "https://search.sezna...",
          "sitelinkCount": 6,
          "topCount": 4,
          "userAgent": "Mozilla X11 Linux x8...",
          "userId": 0,
          "webId": 44596
        }
      ]
    }
  },
  "instance": {
    "pid": "34571",
    "tid": null,
    "log_writer_tid": null,
    "instance_extra": ""
  },
  "location": {
    "file": "fastrpcinterface.cc",
    "function": "postProcess",
    "line": "100"
  }
}

which lends well to further analysis.
The only catch was that it wasn't performant enough for processing more that a few lines, gigabytes would take impractical amount of time (110MB file took 3.5 minutes).

As Ripgrep is an excellent tool that can process these multi-GB files in low seconds and regex crate also has great reputation overall, I've got inspired to try rewriting and generalizing this idea as Rust CLI app transforming stream of messages with arbitrary log format specified mainly by hierarchy of regexes into stream of JSON messages.

I've managed to do it, but the result wasn't as fast as I've hoped, not enough to casually transform whole files near-interactively, so I had to stick with per-case prefiltering just like before, ultimately making the tool not worth finishing, for now at least.

As the regexes are user facing and often adhoc hand tuned for individual servers for handling their own unique message kinds, the intention was to keep them highlevel and not overoptimized - on the other hand, I've put lot of effort into optimizing the program itself, likely even overengineed that a fair bit, so the resulting performance is mostly (75% according to profiling) spent in the regexes themselves. The I/O cost was actually mostly negligible.

Overall I've managed to crunch reference 2.2GB file in around 46 seconds from cold start, which isn't terrible, but in real world I had to wait around 3 minutes for some bigger ones and with, say, 8 logfiles, this wasn't what I hoped for.

Takeaways

I've played with mmap and madvise and at least on machines I've been using (Debian Stretch & Buster, x86_64) there wasn't any notable difference from conventional Readers - I've read your post somewhere with similar findings in Ripgrep.

I've had fancy-regex as optional conditional compilation feature - they are built on top of regex crate, so one would expect that if you don't use advanced features like backtracking, the regexes may be as performant as raw regex crate - unfortunately, it doesn't seem to be the case and there was small, but noticeable perf hit.

Profile guided optimization consistently helped gain around 5-10% speed increase.

Original version was written on top of unicode strings. As the regexes I've used needed only ascii matching, I've tried moving to bytewise (regex::bytes) regexes, but that instead lead to performance hit. Even after adding (?-u) equivalent and fixing the obvious footgun that bytewise regexes still work in unicode mode by default, the performance was still slightly worse overall, especially considering that the unicode verson spent 18% of the time in additional std::str::from_utf8 that wasn't in the bytewise version.

The most notable optimization (from around 4.5 minutes to sub 1 minute) I got from parallelization, having worker threads working on lines independently with number of workers matching number of logical cores (so around 12). I've read your post about the internal scratch spaces in Regex and tradeoffs between API simplicity and optimality - and can confirm that difference between sharing the regexes between threads and cloning them was negligible.

Most of the processing time was spent in handling the primary regex, so we'll only consider that one.
It could be described by this pseudoregex grammar:

# Dbglog grammar (in recursive regex format):
#     logline:
#         `^(?P<timestamp>) (?P<level>): (?:\[(?P<instance>)\](?P<instance_extra>): )?(?P<body>) (?P<location>)$`
#     timestamp:
#         `[\d/]+ \d\d:\d\d:\d\d(?:.\d+)?`
#     level:
#         `[DIWEF][1234]`
#     instance:
#         one of
#             // Builtin format
#             `(?P<pid>)(?::(?P<log_writer_tid>))?(?::\((?P<tid>)\))?`
#             // Overridden format
#             `[^\]]*`
#     instance_extra:
#         one of
#             `\[[^\]]*\]`
#             `[^:]*`
#     pid:
#         `\d+`
#     log_writer_tid:
#         `\d+`
#     tid:
#         `\d+`
#     location:
#         `{(?P<file>):(?P<function>):(?P<line>)}`
#     file:
#         `[^:]+`
#     function:
#         `[^:]+`
#     line:
#         `[^:]+`
#     body:
#         `.*`

In the rust regex syntax the final version I was using looked like this

(?x)
^
# (?P<timestamp>\d\d\d\d/\d\d/\d\d\ \d\d:\d\d:\d\d(?:.\d+)?)
(?P<timestamp>[^\ ]+\ [^\ ]+)

# We could parse log sublevels as well as severity, but severity is enough
# \ (?P<level>[DIWEF][1234]):[\ ]
\ (?P<level>[DIWEF])[1234]:[\ ]
(?P<header>
    (?:
        (?:
            \[ [^\]]*? \] | \( [^\)]*? \)
        ):[\ ]
    )*
)

(?P<body>.*?)

[\ ]\{(?P<location>[^\}]*)\}
$

Conclusions

I wondered why is grepping with ripgrep so much more performant then parsing I did and experimented with various simplified grammars - overall, grep-like usecases that relied on finding string literal (prefix/suffix) were exceptionally fast, while .* or similar kinds of character gobbling loops not so much. Even matching whole line with simple (?P<logline>.*) was taking substantial time IIRC.

I did try the regex DFA CLI debugging tool and some generated machines were suboptimal or at least far more complex than casual regex user expects, consider following simple examples

regex-debug --dfa compile '.*'
regex-debug --dfa compile '(?-u).*'
regex-debug --dfa --bytes compile '(?-u).*'

But even with optimized state machine generation, I somewhat doubt it'd make much difference unless there's a possibility for large performance gain in how loops are handled. I'm not familiar with regex internals, but profiling showed that vast majority of the time was spent in regex::backtrack::Bounded<I>::backtrack. Perhaps that meant my regex was very inefficient? I played with usage of (non)greedy version on all repetitions and it didn't made huge difference. Except the blanket repetition in the body element, actual backtracking shouldn't typically happen much. (Note that the files I worked on were always in valid format.)

Overall, I feel it's likely that getting notably better performance would require move from regexes to other parsing solution, but that would effectively kill this project, as easy configuration of the parsing was a key selling point.

In any case, I hope this writeup was helpful and I'm glad I had a change to talk a bit about this project.
That's all the data I have on hand, but if you'd be interested in something specific or see the code, feel free to let me know.

3 replies

BurntSushi Feb 25, 2023
Maintainer Author

Thank you for the write up! In the future, please feel free to reach out and ask questions on this tracker when you need help optimizing. I am really quite happy to help, assuming you can give me a repro. :)

And that's what I'd like to ask for here. What I really need is a way to reproduce the problems you hit. Is there anyway you can write a short Rust program and give a smallish example (perhaps sanitized) of your haystack? 30MB would be more than enough.

In particular, note the questions I asked. It isn't just about the regex, but the APIs you're using and the haystack.

If you can minimize the fundamental part of your perf problem down to a short program, then I can in exchange give you an analysis. Sand that analysis should hopefully tell you whether there are any opportunities for going faster, including by using a different regex engine.

OpatrilPeter Feb 26, 2023

Sorry for not posting more concrete details, I was aware that's what you were mainly after, but I didn't have time to dig for that at the time.

I've been using captures method, as I've needed to extract contents of named capture groups. In fact, as the names of the capture groups were dynamic, I had to work around the fact that AFAIK Captures type doesn't expose names of the capture groups, so I zipped iterator of captures with Regex::capture_names. I think I checked that the performance cost of this was negligible, at least compared to, say, precalculating Regex::capture_names result and storing it in some lookup structure.

The most useful area to explore from things I've mentioned IMO is that bytewise regex parser was notably slower that unicode one on the apparently same regex, so I've made a example of that.

Hope that's enough!

rs-scratch.zip

By the way, I wonder if there is a lot of wisdom/tips on how to write efficient regexes based on how this crate operates under the hood. If so, that could be a great addition to the documentation!
Of course, caveat here is that it could make contracts about these performance characteristics essentially part of public API.

BurntSushi Feb 26, 2023
Maintainer Author

Thanks! That's perfect. So you'll get a small boost with the internal refactoring here, but not much:

$ rebar cmp tmp/scratch.csv
benchmark                          rust/regex/meta     rust/regexold
---------                          ---------------     -------------
wild/unstructured-to-json/ascii    112.6 MB/s (1.00x)  89.9 MB/s (1.25x)
wild/unstructured-to-json/unicode  112.2 MB/s (1.00x)  109.3 MB/s (1.03x)

As for the difference between "bytewise" and not, this is actually a function of whether Unicode is enabled or not. So you could use bytes::Regex in both cases, but just toggle Unicode on and off. With Unicode, things like [^ ] are represented as codepoints, but without, they get represents as raw bytes. The bytes representation tends to come with higher overhead in the NFA engines (PikeVM, bounded backtracker). In my refactor, I've actually made it so everything always uses the bytes representation and got rid of the Unicode representation. But I also did some optimization work to reduce its overhead, which is likely what's helping here.

By the way, I wonder if there is a lot of wisdom/tips on how to write efficient regexes based on how this crate operates under the hood. If so, that could be a great addition to the documentation!

This already exists: https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md

Otherwise there's really not much else to say here, especially for regexes like yours. Your use case isn't about how quickly matches can be found in a huge haystack, but rather, how quickly the engine can parse a short haystack into capture groups. That is, it's less about "where are the matches" and more about "split the haystack into pieces I understand." They are two very different things, and honestly, backtrackers tend to be better here:

$ rebar cmp tmp/results.csv -e '!go/regexp' -e '!^re2$' -e '!rust'
benchmark                          pcre2               pcre2/jit            python/re            python/regex         regress
---------                          -----               ---------            ---------            ------------         -------
wild/unstructured-to-json/ascii    212.5 MB/s (7.46x)  1585.2 MB/s (1.00x)  129.8 MB/s (12.21x)  136.9 MB/s (11.58x)  288.8 MB/s (5.49x)
wild/unstructured-to-json/unicode  191.6 MB/s (6.03x)  1155.4 MB/s (1.00x)  130.7 MB/s (8.84x)   135.7 MB/s (8.51x)   289.1 MB/s (4.00x)

Incidentally, in your case, the regex crate uses its own backtracker, although it does extra work to bound itself to avoid going exponential. I do wonder if there's some room to improve here by micro-optimizing the regex crate's backtracker, but I doubt it's ever going to reach pcre2/jit speeds, which are pretty incredible here.

But this is a great benchmark, thank you!

OpatrilPeter · 2023-02-28T03:36:00Z

OpatrilPeter
Feb 28, 2023

Awesome, glad I could help and thanks for additional clarifications.

This already exists: https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md

Right, I recall now reading that in the past, so it's definitely discoverable. If I'll ever come across some info that would be a good fit there, I'll make an issue/PR.
Did I understand correctly that this crate's limited success with usage patterns like mine are mostly inherent to the core implementation choices? Perhaps then a broad outline of patterns that will likely work out better in backtracking engines could make a good candidate for addition into that perf document - then again, perhaps actively encouraging people to use competing regex library for what could be more often than not a microoptimization wouldn't be wise, regex is already used in countless projects that then benefit from dependency tree deduplication ... imagine if, say, couple of the widely used crates like clap or env_logger moved to PCRE and nearly everyone would have to live with binaries bloated by having two regex systems side by side.

The pcre2/jit results look very amazing indeed and picked my curiosity enough to try quickly plugging it in my application (using your pcre2 crate), but can't say I've seen pixie dust - the JITed code itself is likely fast, but there is a lot of additional allocations all over the place that overshadowed the potential gains for me. Unsure if it's a tax for the highlevel API, using 4 years old version of the PCRE library or some fumble on my side, but if I'd eventually get back to this, explore it further and find something worth sharing, I'll be sure to reach out.

2 replies

BurntSushi Feb 28, 2023
Maintainer Author

Perhaps then a broad outline of patterns that will likely work out better in backtracking engines could make a good candidate for addition into that perf document

I think this is a lot harder than you think it is unfortunately.

The best way to do this IMO is to publish a benchmark. That's what I'm trying to do, and that's what I started this entire discussion with. :-)

I am working on a regex benchmark (to be made public probably some time this year), and I'm looking to add regexes to it that are used in the real world.

Also, you say you tried PCRE... Do you mean PCRE2? When possible, don't describe what you did, show what you did. :) And yes, your description of there being extra allocs everywhere doesn't make sense to me. PCRE2 provides APIs to amortize allocs.

Also, have you tried this? https://docs.rs/regex/latest/regex/struct.Regex.html#method.captures_read

OpatrilPeter Mar 12, 2023

Finally got time to have a good look at this.

Also, you say you tried PCRE... Do you mean PCRE2?

Yes, as I've said, I tried your PCRE2 bindings 0.2.3 compared to regex 1.6.0.

In my previous post I used captures(), because I've 1) not seen allocation as a huge bottleneck in regex version 2) fancy-regex didn't have captures_read API, which would require special casing its support.

And yes, your description of there being extra allocs everywhere doesn't make sense to me

At least on relative scale seen in flamegraphs, allocation took more significant portion in the PCRE case, but I've mostly came to that conclusion by looking at data from incomparable workloads where that effect looked far more severe i.e. it's mostly comparable.

If you're interested, here are the flamegraphs on same tasks with same level of optimization, look mostly at captures under log2json::process_line_impl. Note that the JITed code is not included under there. (Zooming in is needed, which doesn't seem to work for me without downloading them.)

After that, I did move on to captures_read after that and can no longer see any allocations, so thanks for the suggestion.

However, the true reason the PCRE2 version wasn't as fast for me was an unrelated problem in my parallelization design. In short, as I increase number of threads, there is a quickly rising contention cost that soon beats the gains of parallelizing the task - and as workers using PCRE2 get their job done sooner, this magnifies the issue.

Some sample values of whole program time on identical workload illustrating this:

engine / thread count	1	2	4	6	10	14	16 (what I saw in previous post)	20
regex	181.848 s	92.699 s	48.010 s	32.898 s	20.528 s	20s	63.614 s	89s
pcre2+jit	70.941 s	36.618 s	20.803 s	14.699 s	24.792 s	91s	97.555 s	97s

The problem is likely in my use of crossbeam::channel and flamegraphs do seem to confirm that.

6 workers vs 20 workers:

(Sidenote: I used a work-stealing queue originally, but the performance was actually worse than a simple channel, but maybe there'd be room for improvement.)

Time measurements on regex itself, made on captures_read call

let before_regex =
	measurer.regex_started(part_path, regex_ix);
let whole_match = regex.captures_read(capture_locations, part)?;
measurer.regex_complete(before_regex);

show that PCRE2 is consistently very fast. There is a slowdown as number of threads goes up, possibly by contention on inner shared state (regexes are cloned into each worker).

Following table shows total time, execution count and average time per execution:

engine / thread count	1	6	14	20
regex	102.136449851s (12090663x, avg 8.447µs)	102.133379323s (12090663x, avg 8.447µs)	104.539665646s (12090663x, avg 8.645µs)	114.40951101s (12090663x, avg 9.461µs)
pcre2+jit	12.90345059s (12090663x, avg 1.066µs)	13.941074451s (12090663x, avg 1.152µs)	16.721257186s (12090663x, avg 1.382µs)	17.373337751s (12090663x, avg 1.436µs)

I also did try to recreate the regexes from scratch per worker to disprove it was affecting the major slowdown of the whole application and indeed it did not.

So there are no problems in the regex crate and PCRE is indeed faster for my usecase.
Hope this clears the situation!

Mark-Simulacrum · 2023-04-22T01:41:38Z

Mark-Simulacrum
Apr 22, 2023
Maintainer

I wanted to share a case where replacing a regex with some hand-crafted string searching proved to be a nearly ~2x win. It's not necessarily exactly the same result (didn't fuzz etc) but it's functionally equivalent. The program is rustfilt, which finds and demangles Rust symbols in arbitrary text (and prints that text to stdout/file).

Here is the regex in question: https://github.com/luser/rustfilt/blob/3c81f107b73fdcf7bbb98b426c1214f15946ec90/src/main.rs#L36: Regex::new(r"_(ZN|R)[\$\._[:alnum:]]*").

Here is the PRs (rust-lang/rustc-demangle#62, rust-lang/rustc-demangle#63) that added code to rustc-demangle which is in my measurements ~2x faster end-to-end, essentially replacing the regex crate's replace_all with a hand-coded routine which accomplishes the same result. As far as I can tell, the speedup is pretty much entirely down to avoiding some sort of backtracking in regex. (I have not investigated why there is backtracking, it's just what I see show up in flamegraphs of rustfilt. It's possible the regex in question has some subtle bug or replace_all is a poor choice here).

I've attached (rustc.script.zst.zip) a zstd-compressed (and then zip'd, so GitHub accepts it) ~980MB file (can likely be cut down, should be pretty uniform) that can be used to reproduce this. The file is the result of perf record and perf script on a compilation of libcore with rustc, so it's real world data (demangling that is common practice e.g. to produce a flamegraph).

$ time rustfilt -i ./Build/rust/rustc.script > /dev/null # mainline rustfilt, using the regex above
real    0m21.039s
user    0m20.360s
sys     0m0.672s
$ time ./Build/rustfilt/target/release/rustfilt -i ./Build/rust/rustc.script > /dev/null # custom implementation avoiding the regex
real    0m10.236s
user    0m9.742s
sys     0m0.492s

The rustfilt diff to produce the faster version can be found here: luser/rustfilt#21, in general it's just calling into the rustc_demangle::demangle_stream function though.

Let me know if this is helpful / I can provide more sample data, etc. (Or if you find bugs in the replacement for regex for this scenario :)

2 replies

riking Jul 5, 2023

Mangled symbols are typically ascii-only - can you try a (?-u) version?

BurntSushi Jul 5, 2023
Maintainer Author

Oh excellent! Thank your @Mark-Simulacrum, I had not see this report. Completely missed it. I'll see about digging into this.

(The regex crate does have a bounded backtracker---it still guarantees worst case O(m * n) time---but its use suggests you're extracting capture groups which is not a fast operation. But I'll investigate and figure out what's what. Thank you!)

Is the regex crate a bottleneck in your program? If so, can you share the details? #960

BurntSushi Feb 23, 2023 Maintainer

Replies: 8 comments · 25 replies

BurntSushi Feb 23, 2023 Maintainer Author

BurntSushi Feb 25, 2023 Maintainer Author

BurntSushi Feb 25, 2023 Maintainer Author

BurntSushi Feb 26, 2023 Maintainer Author

BurntSushi Feb 24, 2023 Maintainer Author

BurntSushi Feb 24, 2023 Maintainer Author

BurntSushi Apr 5, 2023 Maintainer Author

BurntSushi Feb 24, 2023 Maintainer Author

BurntSushi Feb 25, 2023 Maintainer Author

Brief background & motivation:

Takeaways

Conclusions

BurntSushi Feb 25, 2023 Maintainer Author

BurntSushi Feb 26, 2023 Maintainer Author

BurntSushi Feb 28, 2023 Maintainer Author

Mark-Simulacrum Apr 22, 2023 Maintainer

BurntSushi Jul 5, 2023 Maintainer Author

BurntSushi
Feb 23, 2023
Maintainer

Replies: 8 comments 25 replies

BurntSushi Feb 23, 2023
Maintainer Author

BurntSushi Feb 25, 2023
Maintainer Author

BurntSushi Feb 25, 2023
Maintainer Author

BurntSushi Feb 26, 2023
Maintainer Author

BurntSushi Feb 24, 2023
Maintainer Author

BurntSushi Feb 24, 2023
Maintainer Author

BurntSushi Apr 5, 2023
Maintainer Author

BurntSushi Feb 24, 2023
Maintainer Author

BurntSushi Feb 25, 2023
Maintainer Author

BurntSushi Feb 25, 2023
Maintainer Author

BurntSushi Feb 26, 2023
Maintainer Author

BurntSushi Feb 28, 2023
Maintainer Author

Mark-Simulacrum
Apr 22, 2023
Maintainer

BurntSushi Jul 5, 2023
Maintainer Author