multiregex

Quickly match many regexes against a string. Provides 2-10x speedups over naïve regex matching.

Introduction

Installation

This project is managed by pixi. You can install the package in development mode using:

git clone https://github.com/quantco/multiregex
cd multiregex

pixi run pre-commit-install
pixi run postinstall
pixi run test

Usage

import multiregex

# Create matcher from multiple regexes.
my_patterns = [r"\w+@\w+\.com", r"\w\.com"]
matcher = multiregex.RegexMatcher(my_patterns)

# Run `re.search` for all regexes.
# Returns a set of matches as (re.Pattern, re.Match) tuples.
matcher.search("[email protected]")
# => [(re.compile('\\w+@\\w+\\.com'), <re.Match ... '[email protected]'>),
#     (re.compile('\\w+\\.com'), <re.Match ... 'example.com'>)]

# Same as above, but with `re.match`.
matcher.match(...)
# Same as above, but with `re.fullmatch`.
matcher.fullmatch(...)

Custom prematchers

To be able to quickly match many regexes against a string, multiregex uses "prematchers" under the hood. Prematchers are lists of non-regex strings of which at least one can be assumed to be present in the haystack if the corresponding regex matches. As an example, a valid prematcher of r"\w+\.com" could be [".com"] and a valid prematcher of r"(B|b)aNäNa" could be ["b"] or ["anäna"]. Note that prematchers must be all-lowercase (in order for multiregex to be able to support re.IGNORECASE).

You will likely have to provide your own prematchers for all but the simplest regex patterns:

multiregex.RegexMatcher([r"\d+"])
# => ValueError: Could not generate prematcher : '\\d+'

To provide custom prematchers, pass (pattern, prematchers) tuples:

multiregex.RegexMatcher([(r"\d+", map(str, range(10)))])

To use a mixture of automatic and custom prematchers, pass prematchers=None:

matcher = multiregex.RegexMatcher([(r"\d+", map(str, range(10))), (r"\w+\.com", None)])
matcher.prematchers
# => {(re.compile('\\d+'), {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}),
#     (re.compile('\\w+\\.com'), {'com'})}

Disabling prematchers

To disable prematching for certain pattern entirely (ie., always run the regex without first running any prematchers), pass an empty list of prematchers:

multiregex.RegexMatcher([(r"super complicated regex", [])])

Profiling prematchers

To check if your prematchers are effective, you can use the built-in prematcher "profiler":

yyyy_mm_dd = r"(19|20)\d\d-\d\d-\d\d"  # Default prematchers: {'-'}
matcher = multiregex.RegexMatcher([yyyy_mm_dd], count_prematcher_false_positives=True)
for string in my_benchmark_dataset:
    matcher.search(string)
print(matcher.format_prematcher_false_positives())
# => For example:
# FP count | FP rate | Pattern / Prematchers
# ---------+---------+----------------------
#      137 |    0.72 | (19|20)\d\d-\d\d-\d\d / {'-'}

In this example, there were 137 input strings that were matched positive by the prematcher but negative by the regex. In other words, the prematcher failed to prevent slow regex evaluation in 72% of the cases.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
docs		docs
multiregex		multiregex
stubs		stubs
test_utils		test_utils
tests		tests
.copier-answers.yml		.copier-answers.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierrc		.prettierrc
CHANGELOG.rst		CHANGELOG.rst
LICENSE		LICENSE
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multiregex

Introduction

Installation

Usage

Custom prematchers

Disabling prematchers

Profiling prematchers

About

Releases 2

Packages

Contributors 7

Languages

License

Quantco/multiregex

Folders and files

Latest commit

History

Repository files navigation

multiregex

Introduction

Installation

Usage

Custom prematchers

Disabling prematchers

Profiling prematchers

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 7

Languages

Packages