regrams is a go package that implements an engine for text search by regular expression queries. Given a collection of documents that are indexed by trigrams and a regular expression, regrams can generate a boolean query for the trigram index that retrieves good candidates that might match the regular expression.
For example, regrams.MakeQuery("a(bc)+d")
returns (abc) (bcb|bcd)
, which
means that any document that matches the regular expression a(bc)+d
must
contain the trigram abc
as well as one of bcb
or bcd
.
regrams is inspired by Google's codesearch repo and Russ Cox's corresponding blog post explaining the implementation. Instead of using the structure of the parsed regular expression to generate queries, however, regrams creates an NFA from the query with nodes weighted by the size of the reachable trigram sets and repeatedly extracts disjunctions from the NFA by identifying and removing minimum-weight graph cuts. I wrote a post on regular expression search via graph cuts that gives a lot more detail about the technique.
The easiest way to try out this package is to run the command-line wrapper included in this repo:
$ go run cmd/regrams/main.go abcd
(abc) (bcd)
$ go run cmd/regrams/main.go 'Time: [0-9]+ ms'
( ms) (: 0|: 1|: 2|: 3|: 4|: 5|: 6|: 7|: 8|: 9) (Tim) (e: ) (ime) (me:)
$ go run cmd/regrams/main.go '(abc*)+de'
(aba|abc|abd) (bab|bca|bcc|bcd|bde)
$ go run cmd/regrams/main.go '(?i)abc'
(ABC|ABc|AbC|Abc|aBC|aBc|abC|abc)
Running go run cmd/regrams/main.go --help
will give you more options.
Alternatively, you can try out examples in a browser at the Go Playground.
To use regrams as a package, import github.com/aaw/regrams
, then call
regrams.MakeQuery
in your code to generate a trigram query from a string
representing a regular expression. MakeQuery
returns a slice-of-slices of
strings, which should be interpreted as a boolean query in conjunctive
normal form: the slice [["abc"], ["xyz","xyw"], ["xxx", "yyy", "zzz"]]
represents the query abc AND (xyz OR xyw) AND (xxx OR yyy OR zzz)
.