This document describes the internal design of this crate, which is an object lesson in what happens when you take a fairly simple old algorithm like Aho-Corasick and make it fast and production ready.
The target audience of this document is Rust programmers that have some familiarity with string searching, however, one does not need to know the Aho-Corasick algorithm in order to read this (it is explained below). One should, however, know what a trie is. (If you don't, go read its Wikipedia article.)
The center-piece of this crate is an implementation of Aho-Corasick. On its own, Aho-Corasick isn't that complicated. The complex pieces come from the different variants of Aho-Corasick implemented in this crate. Specifically, they are:
- Aho-Corasick as a noncontiguous NFA. States have their transitions represented sparsely, and each state puts its transitions in its own separate allocation. Hence the same "noncontiguous."
- Aho-Corasick as a contiguous NFA. This NFA uses a single allocation to represent the transitions of all states. That is, transitions are laid out contiguously in memory. Moreover, states near the starting state are represented densely, such that finding the next state ID takes a constant number of instructions.
- Aho-Corasick as a DFA. In this case, all states are represented densely in a transition table that uses one allocation.
- Supporting "standard" match semantics, along with its overlapping variant, in addition to leftmost-first and leftmost-longest semantics. The "standard" semantics are typically what you see in a textbook description of Aho-Corasick. However, Aho-Corasick is also useful as an optimization in regex engines, which often use leftmost-first or leftmost-longest semantics. Thus, it is useful to implement those semantics here. The "standard" and "leftmost" search algorithms are subtly different, and also require slightly different construction algorithms.
- Support for ASCII case insensitive matching.
- Support for accelerating searches when the patterns all start with a small
number of fixed bytes. Or alternatively, when the patterns all contain a
small number of rare bytes. (Searching for these bytes uses SIMD vectorized
code courtesy of
memchr
.) - Transparent support for alternative SIMD vectorized search routines for smaller number of literals, such as the Teddy algorithm. We called these "packed" search routines because they use SIMD. They can often be an order of magnitude faster than just Aho-Corasick, but don't scale as well.
- Support for searching streams. This can reuse most of the underlying code, but does require careful buffering support.
- Support for anchored searches, which permit efficient "is prefix" checks for a large number of patterns.
When you combine all of this together along with trying to make everything as
fast as possible, what you end up with is enitrely too much code with too much
unsafe
. Alas, I was not smart enough to figure out how to reduce it. Instead,
we will explain it.
The fundamental problem this crate is trying to solve is to determine the occurrences of possibly many patterns in a haystack. The naive way to solve this is to look for a match for each pattern at each position in the haystack:
for i in 0..haystack.len():
for p in patterns.iter():
if haystack[i..].starts_with(p.bytes()):
return Match(p.id(), i, i + p.bytes().len())
Those four lines are effectively all this crate does. The problem with those four lines is that they are very slow, especially when you're searching for a large number of patterns.
While there are many different algorithms available to solve this, a popular one is Aho-Corasick. It's a common solution because it's not too hard to implement, scales quite well even when searching for thousands of patterns and is generally pretty fast. Aho-Corasick does well here because, regardless of the number of patterns you're searching for, it always visits each byte in the haystack exactly once. This means, generally speaking, adding more patterns to an Aho-Corasick automaton does not make it slower. (Strictly speaking, however, this is not true, since a larger automaton will make less effective use of the CPU's cache.)
Aho-Corasick can be succinctly described as a trie with state transitions between some of the nodes that efficiently instruct the search algorithm to try matching alternative keys in the trie. The trick is that these state transitions are arranged such that each byte of input needs to be inspected only once. These state transitions are typically called "failure transitions," because they instruct the searcher (the thing traversing the automaton while reading from the haystack) what to do when a byte in the haystack does not correspond to a valid transition in the current state of the trie.
More formally, a failure transition points to a state in the automaton that may lead to a match whose prefix is a proper suffix of the path traversed through the trie so far. (If no such proper suffix exists, then the failure transition points back to the start state of the trie, effectively restarting the search.) This is perhaps simpler to explain pictorally. For example, let's say we built an Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. The trie looks like this:
a - S1 - b - S2 - c - S3 - d - S4*
/
S0 - c - S5 - e - S6 - f - S7*
where states marked with a *
are match states (meaning, the search algorithm
should stop and report a match to the caller).
So given this trie, it should be somewhat straight-forward to see how it can
be used to determine whether any particular haystack starts with either
abcd
or cef
. It's easy to express this in code:
fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {
let mut state_id = trie.start();
// If the empty pattern is in trie, then state_id is a match state.
if trie.is_match(state_id) {
return true;
}
for (i, &b) in haystack.iter().enumerate() {
state_id = match trie.next_state(state_id, b) {
Some(id) => id,
// If there was no transition for this state and byte, then we know
// the haystack does not start with one of the patterns in our trie.
None => return false,
};
if trie.is_match(state_id) {
return true;
}
}
false
}
And that's pretty much it. All we do is move through the trie starting with the bytes at the beginning of the haystack. If we find ourselves in a position where we can't move, or if we've looked through the entire haystack without seeing a match state, then we know the haystack does not start with any of the patterns in the trie.
The meat of the Aho-Corasick algorithm is in how we add failure transitions to our trie to keep searching efficient. Specifically, it permits us to not only check whether a haystack starts with any one of a number of patterns, but rather, whether the haystack contains any of a number of patterns anywhere in the haystack.
As mentioned before, failure transitions connect a proper suffix of the path
traversed through the trie before, with a path that leads to a match that has a
prefix corresponding to that proper suffix. So in our case, for patterns abcd
and cef
, with a haystack abcef
, we want to transition to state S5
(from
the diagram above) from S3
upon seeing that the byte following c
is not
d
. Namely, the proper suffix in this example is c
, which is a prefix of
cef
. So the modified diagram looks like this:
a - S1 - b - S2 - c - S3 - d - S4*
/ /
/ ----------------
/ /
S0 - c - S5 - e - S6 - f - S7*
One thing that isn't shown in this diagram is that all states have a failure
transition, but only S3
has a non-trivial failure transition. That is, all
other states have a failure transition back to the start state. So if our
haystack was abzabcd
, then the searcher would transition back to S0
after
seeing z
, which effectively restarts the search. (Because there is no pattern
in our trie that has a prefix of bz
or z
.)
The code for traversing this automaton or finite state machine (it is no
longer just a trie) is not that much different from the has_prefix
code
above:
fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {
let mut state_id = fsm.start();
// If the empty pattern is in fsm, then state_id is a match state.
if fsm.is_match(state_id) {
return true;
}
for (i, &b) in haystack.iter().enumerate() {
// While the diagram above doesn't show this, we may wind up needing
// to follow multiple failure transitions before we land on a state
// in which we can advance. Therefore, when searching for the next
// state, we need to loop until we don't see a failure transition.
//
// This loop terminates because the start state has no empty
// transitions. Every transition from the start state either points to
// another state, or loops back to the start state.
loop {
match fsm.next_state(state_id, b) {
Some(id) => {
state_id = id;
break;
}
// Unlike our code above, if there was no transition for this
// state, then we don't quit. Instead, we look for this state's
// failure transition and follow that instead.
None => {
state_id = fsm.next_fail_state(state_id);
}
};
}
if fsm.is_match(state_id) {
return true;
}
}
false
}
Other than the complication around traversing failure transitions, this code is still roughly "traverse the automaton with bytes from the haystack, and quit when a match is seen."
And that concludes our section on the basics. While we didn't go deep into how
the automaton is built (see src/nfa/noncontiguous.rs
, which has detailed
comments about that), the basic structure of Aho-Corasick should be reasonably
clear.
There are generally two types of finite automata: non-deterministic finite automata (NFA) and deterministic finite automata (DFA). The difference between them is, principally, that an NFA can be in multiple states at once. This is typically accomplished by things called epsilon transitions, where one could move to a new state without consuming any bytes from the input. (The other mechanism by which NFAs can be in more than one state is where the same byte in a particular state transitions to multiple distinct states.) In contrast, a DFA can only ever be in one state at a time. A DFA has no epsilon transitions, and for any given state, a byte transitions to at most one other state.
By this formulation, the Aho-Corasick automaton described in the previous
section is an NFA. This is because failure transitions are, effectively,
epsilon transitions. That is, whenever the automaton is in state S
, it is
actually in the set of states that are reachable by recursively following
failure transitions from S
until you reach the start state. (This means
that, for example, the start state is always active since the start state is
reachable via failure transitions from any state in the automaton.)
NFAs have a lot of nice properties. They tend to be easier to construct, and also tend to use less memory. However, their primary downside is that they are typically slower to execute a search with. For example, the code above showing how to search with an Aho-Corasick automaton needs to potentially iterate through many failure transitions for every byte of input. While this is a fairly small amount of overhead, this can add up, especially if the automaton has a lot of overlapping patterns with a lot of failure transitions.
A DFA's search code, by contrast, looks like this:
fn contains(dfa: &DFA, haystack: &[u8]) -> bool {
let mut state_id = dfa.start();
// If the empty pattern is in dfa, then state_id is a match state.
if dfa.is_match(state_id) {
return true;
}
for (i, &b) in haystack.iter().enumerate() {
// An Aho-Corasick DFA *never* has a missing state that requires
// failure transitions to be followed. One byte of input advances the
// automaton by one state. Always.
state_id = dfa.next_state(state_id, b);
if dfa.is_match(state_id) {
return true;
}
}
false
}
The search logic here is much simpler than for the NFA, and this tends to translate into significant performance benefits as well, since there's a lot less work being done for each byte in the haystack. How is this accomplished? It's done by pre-following all failure transitions for all states for all bytes in the alphabet, and then building a single state transition table. Building this DFA can be much more costly than building the NFA, and use much more memory, but the better performance can be worth it.
Users of this crate can actually choose between using one of two possible NFAs
(noncontiguous or contiguous) or a DFA. By default, a contiguous NFA is used,
in most circumstances, but if the number of patterns is small enough a DFA will
be used. A contiguous NFA is chosen because it uses orders of magnitude less
memory than a DFA, takes only a little longer to build than a noncontiguous
NFA and usually gets pretty close to the search speed of a DFA. (Callers can
override this automatic selection via the AhoCorasickBuilder::start_kind
configuration.)
As described in the previous section, one of the downsides of using a DFA is that it uses more memory and can take longer to build. One small way of mitigating these concerns is to map the alphabet used by the automaton into a smaller space. Typically, the alphabet of a DFA has 256 elements in it: one element for each possible value that fits into a byte. However, in many cases, one does not need the full alphabet. For example, if all patterns in an Aho-Corasick automaton are ASCII letters, then this only uses up 52 distinct bytes. As far as the automaton is concerned, the rest of the 204 bytes are indistinguishable from one another: they will never disrciminate between a match or a non-match. Therefore, in cases like that, the alphabet can be shrunk to just 53 elements. One for each ASCII letter, and then another to serve as a placeholder for every other unused byte.
In practice, this library doesn't quite compute the optimal set of equivalence classes, but it's close enough in most cases. The key idea is that this then allows the transition table for the DFA to be potentially much smaller. The downside of doing this, however, is that since the transition table is defined in terms of this smaller alphabet space, every byte in the haystack must be re-mapped to this smaller space. This requires an additional 256-byte table. In practice, this can lead to a small search time hit, but it can be difficult to measure. Moreover, it can sometimes lead to faster search times for bigger automata, since it could be difference between more parts of the automaton staying in the CPU cache or not.
One other trick for DFAs employed by this crate is the notion of premultiplying state identifiers. Specifically, the normal way to compute the next transition in a DFA is via the following (assuming that the transition table is laid out sequentially in memory, in row-major order, where the rows are states):
next_state_id = dfa.transitions[current_state_id * 256 + current_byte]
However, since the value 256
is a fixed constant, we can actually premultiply
the state identifiers in the table when we build the table initially. Then, the
next transition computation simply becomes:
next_state_id = dfa.transitions[current_state_id + current_byte]
This doesn't seem like much, but when this is being executed for every byte of input that you're searching, saving that extra multiplication instruction can add up.
The same optimization works even when equivalence classes are enabled, as described above. The only difference is that the premultiplication is by the total number of equivalence classes instead of 256.
There isn't much downside to premultiplying state identifiers, other than it imposes a smaller limit on the total number of states in the DFA. Namely, with premultiplied state identifiers, you run out of room in your state identifier representation more rapidly than if the identifiers are just state indices.
Both equivalence classes and premultiplication are always enabled. There is a
AhoCorasickBuilder::byte_classes
configuration, but disabling this just makes
it so there are always 256 equivalence classes, i.e., every class corresponds
to precisely one byte. When it's disabled, the equivalence class map itself is
still used. The purpose of disabling it is when one is debugging the underlying
automaton. It can be easier to comprehend when it uses actual byte values for
its transitions instead of equivalence classes.
One of the more interesting things about this implementation of Aho-Corasick that (as far as this author knows) separates it from other implementations, is that it natively supports leftmost-first and leftmost-longest match semantics. Briefly, match semantics refer to the decision procedure by which searching will disambiguate matches when there are multiple to choose from:
- standard match semantics emits matches as soon as they are detected by the automaton. This is typically equivalent to the textbook non-overlapping formulation of Aho-Corasick.
- leftmost-first match semantics means that 1) the next match is the match starting at the leftmost position and 2) among multiple matches starting at the same leftmost position, the match corresponding to the pattern provided first by the caller is reported.
- leftmost-longest is like leftmost-first, except when there are multiple matches starting at the same leftmost position, the pattern corresponding to the longest match is returned.
(The crate API documentation discusses these differences, with examples, in
more depth on the MatchKind
type.)
The reason why supporting these match semantics is important is because it gives the user more control over the match procedure. For example, leftmost-first permits users to implement match priority by simply putting the higher priority patterns first. Leftmost-longest, on the other hand, permits finding the longest possible match, which might be useful when trying to find words matching a dictionary. Additionally, regex engines often want to use Aho-Corasick as an optimization when searching for an alternation of literals. In order to preserve correct match semantics, regex engines typically can't use the standard textbook definition directly, since regex engines will implement either leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.
Supporting leftmost semantics requires a couple key changes:
- Constructing the Aho-Corasick automaton changes a bit in both how the trie is
constructed and how failure transitions are found. Namely, only a subset
of the failure transitions are added. Specifically, only the failure
transitions that either do not occur after a match or do occur after a match
but preserve that match are kept. (More details on this can be found in
src/nfa/noncontiguous.rs
.) - The search algorithm changes slightly. Since we are looking for the leftmost match, we cannot quit as soon as a match is detected. Instead, after a match is detected, we must keep searching until either the end of the input or until a dead state is seen. (Dead states are not used for standard match semantics. Dead states mean that searching should stop after a match has been found.)
Most other implementations of Aho-Corasick do support leftmost match semantics, but they do it with more overhead at search time, or even worse, with a queue of matches and sophisticated hijinks to disambiguate the matches. While our construction algorithm becomes a bit more complicated, the correct match semantics fall out from the structure of the automaton itself.
One of the nice properties of an Aho-Corasick automaton is that it can report all possible matches, even when they overlap with one another. In this mode, the match semantics don't matter, since all possible matches are reported. Overlapping searches work just like regular searches, except the state identifier at which the previous search left off is carried over to the next search, so that it can pick up where it left off. If there are additional matches at that state, then they are reported before resuming the search.
Enabling leftmost-first or leftmost-longest match semantics causes the automaton to use a subset of all failure transitions, which means that overlapping searches cannot be used. Therefore, if leftmost match semantics are used, attempting to do an overlapping search will return an error (or panic when using the infallible APIs). Thus, to get overlapping searches, the caller must use the default standard match semantics. This behavior was chosen because there are only two alternatives, which were deemed worse:
- Compile two automatons internally, one for standard semantics and one for the semantics requested by the caller (if not standard).
- Create a new type, distinct from the
AhoCorasick
type, which has different capabilities based on the configuration options.
The first is untenable because of the amount of memory used by the automaton. The second increases the complexity of the API too much by adding too many types that do similar things. It is conceptually much simpler to keep all searching isolated to a single type.
Since Aho-Corasick is an automaton, it is possible to do partial searches on partial parts of the haystack, and then resume that search on subsequent pieces of the haystack. This is useful when the haystack you're trying to search is not stored contiguously in memory, or if one does not want to read the entire haystack into memory at once.
Currently, only standard semantics are supported for stream searching. This is some of the more complicated code in this crate, and is something I would very much like to improve. In particular, it currently has the restriction that it must buffer at least enough of the haystack in memory in order to fit the longest possible match. The difficulty in getting stream searching right is that the implementation choices (such as the buffer size) often impact what the API looks like and what it's allowed to do.
In some cases, Aho-Corasick is not the fastest way to find matches containing multiple patterns. Sometimes, the search can be accelerated using highly optimized SIMD routines. For example, consider searching the following patterns:
Sherlock
Moriarty
Watson
It is plausible that it would be much faster to quickly look for occurrences of
the leading bytes, S
, M
or W
, before trying to start searching via the
automaton. Indeed, this is exactly what this crate will do.
When there are more than three distinct starting bytes, then this crate will look for three distinct bytes occurring at any position in the patterns, while preferring bytes that are heuristically determined to be rare over others. For example:
Abuzz
Sanchez
Vasquez
Topaz
Waltz
Here, we have more than 3 distinct starting bytes, but all of the patterns
contain z
, which is typically a rare byte. In this case, the prefilter will
scan for z
, back up a bit, and then execute the Aho-Corasick automaton.
If all of that fails, then a packed multiple substring algorithm will be
attempted. Currently, the only algorithm available for this is Teddy, but more
may be added in the future. Teddy is unlike the above prefilters in that it
confirms its own matches, so when Teddy is active, it might not be necessary
for Aho-Corasick to run at all. However, the current Teddy implementation
only works in x86_64
when SSSE3 or AVX2 are available or in aarch64
(using NEON), and moreover, only works well when there are a small number
of patterns (say, less than 100). Teddy also requires the haystack to be of a
certain length (more than 16-34 bytes). When the haystack is shorter than that,
Rabin-Karp is used instead. (See src/packed/rabinkarp.rs
.)
There is a more thorough description of Teddy at
src/packed/teddy/README.md
.