performance expectations #136

victorjulien · 2023-11-28T17:09:46Z

victorjulien
Nov 28, 2023

Hi, let me start out by saying that I appreciate the work that you're doing here. Thanks a lot!

In Suricata [1], a free and open source network IDS/IPS, we heavily use multi pattern matching (mpm) to reduce the work we have to do per packet. We have traditionally used our own Aho-Corasick implementations for this [2][3] but nowadays most of our users will use the much faster Hyperscan library support.

Since the aho-corasick crate is already a transitive dependency for us, I thought it might be worth checking if we can also use it for our prefiltering mpm needs. The results have been a bit disappointing, with my best implementation using this crate being between 2.5 and 3 times slower than Hyperscan, and between 1.5 times and 2 times than our own Aho-Corasick implementation (written in C) [2].

I guess I'm looking for some feedback on this. Do these numbers seems reasonable based on other feedback or are they perhaps indicative of issues in my implementation? Since Suricata is mixed C and Rust we do have the FFI border to cross, but I would be surprised if that plays a big role based on our other experiences.

Perhaps our use case is just not very well supported. We use multiple pattern sets and the largest (and most commonly executed) ones exceed 5k to 10k patterns of very mixed quality and size, heavily using the full 8 bit alphabet.

Appreciate any feedback on this.

Oh the PR to hook aho-corasick code into Suricata's logic can be found here:
OISF/suricata@bcb8b0a#diff-afa227151ac29b6f3c5955da7506281ec1539addaccf440df81ce57f3279044cR117

Regards,
Victor

[1] https://github.com/OISF/suricata
[2] https://github.com/OISF/suricata/blob/master/src/util-mpm-ac.c#L935
[3] https://github.com/OISF/suricata/blob/master/src/util-mpm-ac-ks.c

BurntSushi · 2023-11-28T17:46:19Z

BurntSushi
Nov 28, 2023
Maintainer

It's really hard to say without an easy reproduction at my fingertips that I can experiment with. There are so many variables that don't come through a prose description that it's just really impossible to say too much.

At the number of patterns you're talking about, this library doesn't have any particularly clever tricks up its sleeve. I believe Hyperscan does, and will use a SIMD algorithm called FDR. It's on my list of things to learn more about and potentially port, but I haven't gotten around to it yet. This library does use SIMD, but only when the number of literals is much smaller (say around 100 or less). So at least, with respect to Hyperscan, I wouldn't expect this library to beat it. Hyperscan is really best in class here.

Otherwise, one thing that sticks out to me is that your C implementation appears to be a DFA, but you aren't configuring this library to use a DFA. You'll want to use AhoCorasickBuilder::kind for that.

I will say that the overlapping case tends to have less optimization work put into it, mostly because it's just not a mode that I use often. The leftmost-first match semantics have had the most effort put into them.

Perhaps our use case is just not very well supported. We use multiple pattern sets and the largest (and most commonly executed) ones exceed 5k to 10k patterns of very mixed quality and size, heavily using the full 8 bit alphabet.

The number of patterns shouldn't be an issue for this library. (I regularly test it with millions of patterns.) This library has three different implementation of Aho-Corasick:

A non-contiguous NFA that somewhat reflects the textbook description, but has some tricks to limit memory usage.
A contiguous NFA that uses even less memory and has better locality. It strikes a very good balance between search speed and memory usage. This is likely what is being used based on the code you linked.
A DFA. This has exorbitant memory usage, but has the best search speed. By default, a DFA can be used, but only when the number of patterns is pretty small.

So perhaps if you switch over to the DFA explicitly, the search speed of this library will improve a bit. But if you stick with the contiguous NFA, you might get slightly slower search speeds, but memory usage is likely substantially better. I don't know if that matters to you or not, but it's a benefit worth pointing out IMO.

7 replies

BurntSushi Nov 28, 2023
Maintainer

Setting kind to DFA seems to improve the performance considerably, although still not as much as I'd hope.

How does it compare with your C implementation? I would generally hope they're roughly on par. I wouldn't necessarily expect it to be faster than yours.

I did not see a way to anchor patterns. Hyperscan offers a min_offset and max_offset (in suricata speak this is offset and depth) that we use. In our AC matcher we check that for each match, like I did in my code wrapping this crate. Did I overlook or misunderstand some anchoring capabilities?

See https://docs.rs/aho-corasick/latest/aho_corasick/struct.AhoCorasick.html#search-configuration and https://docs.rs/aho-corasick/latest/aho_corasick/struct.Input.html. The Input type is how you control where the search starts and ends, and also whether to run an anchored search. Note that you need to opt into anchored searches at construction time.

Otherwise though, if you have a slice haystack, then you can just do &haystack[start..end] to search a subsequence of it.

I'm currently testing inside Suricata, which wouldn't be ideal for anyone to analyze it, but it does give us the performance insights that matter to us. What would be a way to provide a test case that would be useful to you?

Basically a sequence of commands with inputs that I can on my own machine and profile things.

But if this library's DFA speed is roughly on par with your C library's DFA, then that sounds like what I'd expect. A DFA is just a DFA. There really isn't much to be done about it. To get faster than a DFA, you need to start doing tricks like what Hyperscan does.

victorjulien Nov 28, 2023
Author

I'm testing with 2 rulesets, "open" and "pro". "Pro" is a superset of "open".

For "open" I get the following results in cpu ticks per byte inspected:

hyperscan: 8
ac: 17
ac-ks: 19 (a variant of "ac" that tries to be more space efficient)
ac-rs: 23 (based on this crate)

For "pro":

hyperscan: 12
ac: 24
ac-ks: 25
ac-rs: 33

So "ac-rs" seems to be around 35% more expensive than "ac" in cpu ticks per byte.

Wrt the anchoring, I think I'm still not understanding the explanation. What we have in our case is that per pattern our rule language can specify anchoring, so it's a property of the pattern. E.g. something like "POST" should be the first 4 bytes of a http method buffer, or "\xffSMB" should start after 4 bytes into the data. Our mpm API allows specifying this per pattern and in hyperscan the API supports it, while in our AC implementation we explicitly check for it in the algo using:

                        const SCACPatternList *pat = &pid_pat_list[pids[k]];
                        const int offset = i - pat->patlen + 1;

                        if (offset < (int)pat->offset || (pat->depth && i > pat->depth))
                            continue;

Does this crate allow something similar to hyperscan? I did not see it, which is why I added it similar to the C code above, as:

        /* enforce offset and depth */
        if pattern.offset as usize > mat.start() {
            SCLogDebug!("pattern {:?} failed: found before offset", pat_id);
            continue;
        }
        if pattern.depth != 0 && mat.end() > pattern.depth as usize {
            SCLogDebug!("pattern {:?} failed: after depth", pat_id);
            continue;
        }

BurntSushi Nov 28, 2023
Maintainer

Oh, no, there's no support for pattern specific rules like that. That probably requires something beyond what Aho-Corasick itself supports. You could use regex-automata to create a regex for each pattern that has the desired semantics I imagine, but I wouldn't be surprised if perf was worse. And it scales a lot worse than Aho-Corasick.

As for your timings, if you are able to create a simple reproduction for me to try, I can at least say that I'll take a look at profiling it and see if there's anything I can do on my end to make the search faster. Notice, for example, this comment in the core search loop:

aho-corasick/src/automaton.rs

Lines 1311 to 1319 in f227162

    
           // I've tried unrolling this loop and eliding bounds checks, but no 
        
           // matter what I did, I could not observe a consistent improvement on 
        
           // any benchmark I could devise. (If someone wants to re-litigate this, 
        
           // the way to do it is to add an 'next_state_unchecked' method to the 
        
           // 'Automaton' trait with a default impl that uses 'next_state'. Then 
        
           // use 'aut.next_state_unchecked' here and implement it on DFA using 
        
           // unchecked slice index acces.) 
        
           sid = aut.next_state(anchored, sid, input.haystack()[at]); 
        
           if aut.is_special(sid) {

Your C code doesn't appear to have any bounds checks for example. It might be worth experimenting there to see if it helps in your benchmark.

Another area where this crate will probably do worse than a bespoke implementation is match latency. Whenever call a search routine on AhoCorasick, it has to go through a virtual function call and probably some other stuff because of the abstractions I built up. But you can always drop down and use the DFA directly with the Automaton trait. It's a little more code to write, but a lot less than a full Aho-Corasick implementation.

victorjulien Nov 28, 2023
Author

It seems the latency or start up cost is not that bad actually. We inspect several different data buffer types, and there is one that averages at 3 bytes. Here my crate wrapper code performs better than hyperscan, although worse than the C Aho-Corasick variants. Suggesting that hyperscan has a higher start up cost. OTOH the C implementations being faster here might explain at least part of that 35% percent gap.

victorjulien Nov 29, 2023
Author

I've tried to directly call the DFA logic here OISF/suricata@3a6eb43, is this what you had in mind?

It gives a few percent better numbers, but not enough to bridge the gap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance expectations #136

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

performance expectations #136

victorjulien Nov 28, 2023

Replies: 1 comment · 7 replies

BurntSushi Nov 28, 2023 Maintainer

BurntSushi Nov 28, 2023 Maintainer

victorjulien Nov 28, 2023 Author

BurntSushi Nov 28, 2023 Maintainer

victorjulien Nov 28, 2023 Author

victorjulien Nov 29, 2023 Author

victorjulien
Nov 28, 2023

Replies: 1 comment 7 replies

BurntSushi
Nov 28, 2023
Maintainer

BurntSushi Nov 28, 2023
Maintainer

victorjulien Nov 28, 2023
Author

BurntSushi Nov 28, 2023
Maintainer

victorjulien Nov 28, 2023
Author

victorjulien Nov 29, 2023
Author