performance comparison vs std/re, std/nre #57

timotheecour · 2020-02-29T00:11:41Z

It would be really nice to have a benchmark comparing performance wrt std/re, std/nre; I haven't seen any.
This would also be useful for performance regression testing.
This could start small and become more comprehensive as needed.

links

https://forum.nim-lang.org/t/6008#37206

nitely · 2020-02-29T02:21:46Z

IIRC it's 40 times slower. Someone reported it to be 5 times slower. It really depends on the regex, PCRE has exponential time complexity, and while those cases are kind of rare, it's easy to build a regex that runs in quadratic time (and/or does a lot of backtracking), so nim-regex is likely faster in such cases.

I'm working on making it faster (as nregex when there are capture groups, i.e: 4 times slower than PCRE), but it'll take me a little while to get it done.

timotheecour · 2020-02-29T03:19:27Z

oh yikes; maybe 1st step to improve this would be adding some benchmark code somewhere and keep track of performance... if you have already one maybe link it here?

I suggest using this: https://github.com/mariomka/regex-benchmark

any ideas where the bottlnecks, and whether this is an implementation issue or if the design itself is the limiting factor compared to PCRE?

D's std.regex

I read a few times that D has/had the fastest regex engine (https://dlang.org/library/std/regex.html), but I can't find a recent benchmark on this:
https://forum.dlang.org/post/[email protected]

It's kinda sad because std.regex used to be touted as the fastest regex engine even compared to other regex libraries. But it seems others have caught up since while we've stagnated. :-/

it seems a bit in contradiction with numbers from https://github.com/mariomka/regex-benchmark ; /cc @DmitryOlshansky do you have any insights / links?

If it is one of the fastest, may be good to look into its design given some similarities that D and nim have (in particular, both being able to do CTFE, which is relevant for this discussion, since std.regex uses that for bytecode (relevant even if we're in the end talking about running regex at RT))

here's his (dated) blog post: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ which has interesting insights

alternative approach: better PCRE wrapper

I'm wondering whether we can as an alternative wrap PCRE but using a better API than nim's std/re, std/nre, which have a number of issues.
as for running regex at CT, I tried re and nre with -d:nimHasLibFFI; it "almost works" but needs small (maybe) language change to relax when casting in VM is allowed; so we could have high performance using both RT and CT

nitely · 2020-02-29T06:03:58Z

any ideas where the bottlnecks, and whether this is an implementation issue or if the design itself is the limiting factor compared to PCRE?

It's a design issue, AFAIK. nim-regex is based on Thompson's construction and multiple-state simulation, as described by Russ Cox [3]. Same as RE2, golang's regex, and rust-regex, albeit they have some optimizations to avoid falling back to the slow NFA engine.

Can a non-backtracking NFA based engine consistently (as in for every regex) reach PCRE performance? ¯\_(ツ)_/¯. I've never seen it happen. Thompson's algorithms don't come close, I can tell you that much.

I plan to improve the design and add some RE2 optimizations (the on-the-fly bounded-DFA construction at least) to make nim-regex faster than it's now.

I read a few times that D has/had the fastest regex engine (https://dlang.org/library/std/regex.html), but I can't find a recent benchmark on this:
https://forum.dlang.org/post/[email protected]

Nim has the fastest regex engine that I know of [4], at least for expressions without capture groups :p, which surprisingly covers a lot of cases: regex validation, search/find boundaries, search&replace, etc. But DFAs have their own drawbacks (i.e: exponential construction/compilation blow-up in rare cases).

If it is one of the fastest, may be good to look into its design given some similarities that D and nim have (in particular, both being able to do CTFE, which is relevant for this discussion, since std.regex uses that for bytecode (relevant even if we're in the end talking about running regex at RT))

here's his (dated) blog post: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ which has interesting insights

That's a good read, but Bit-Parallel NFA Simulation (Glushkov's NFA simulation) is not something new. While it's faster than Thompson's simulation, it's not faster than a backtracking JITted simulation. Also, It's not a general solution (i.e: it only supports NFAs smaller than WORD-size), I'd guess D uses it to find prefixes (i.e: it only uses the first WORD-size parallel states of the NFA).

The state of NFA based engines boils down to a bunch of speed-ups on top of a slow algorithm. NFA is a dead end, performance wise. D's regex, same as RE2 and rust-regex, are based on an NFA engine. They put some other engines on top that work in some cases depending on the regular expression, but they fallback to the dog slow NFA engine otherwise.

alternative approach: better PCRE wrapper

There are too many horror stories of using backtracking regex engines (such as PCRE) in production [0] [1] [2], though, and they are not necessarily fast, just fast in some cases. For example, the regexes in mariomka/regex-benchmark will have limited backtracking, and depending on the input, no backtracking at all, so PCRE looks good in those benchmarks.

timotheecour · 2020-02-29T07:51:55Z

if the main problem for PCRE is good average performance but exponential worst case performance, can't a hybrid algorithm be used (at the expense of code complexity, granted)? That strategy worked out well in a different context here nim-lang/Nim#13440, but it's also a common strategy, eg introsort uses same trick https://en.wikipedia.org/wiki/Introsort

begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms [...]

nitely · 2020-02-29T08:38:25Z

Sure, but PCRE is fast because it has a JIT compiler, and a lot of work went into optimizing both the compiler and the backtracking algo. idk whether it's feasible to replicate the work.

It'd be interesting to know why PCRE doesn't use some heuristic to switch to a linear time algorithm in the exponential time cases.

nitely · 2020-02-29T09:14:55Z

I think the NFA/DFA hybrid approach (what RE2 and rust-regex do) would yield much better results, and make nim-regex fast enough. It's also easier to implement, so we'll see.

timotheecour · 2020-02-29T09:55:58Z

idk whether it's feasible to replicate the work

i meant wrap, not replicate here:

have a fast regex engine at runtime via reusing PCRE; with a strategy to switch to another algorithm (eg the one in nim-regex for eg) as fallback depth search is too high
expose it via a good API (maybe simply nim-regex's current API)
for running all this at CT (which would use importc pcre), there are 2 options:
- -d:nimHasLibFFI
- using vmops
- or giving up on PCRE-level performance and using current nim-regex, which is fine for CT work
for js target I guess this wouldn't work, but fallback to current nim-regex is good enough

nitely mentioned this issue Mar 24, 2020

NFA macro + better NFA matcher #58

Merged

nitely closed this as completed in #58 Apr 4, 2020

timotheecour mentioned this issue Apr 4, 2020

misc tracking of unresolved issues/comments timotheecour/Nim#87

Open

timotheecour mentioned this issue May 4, 2021

re/nre: mention nim-regex as the alternative nim-lang/Nim#17926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance comparison vs std/re, std/nre #57

performance comparison vs std/re, std/nre #57

timotheecour commented Feb 29, 2020 •

edited

Loading

nitely commented Feb 29, 2020

timotheecour commented Feb 29, 2020

nitely commented Feb 29, 2020 •

edited

Loading

timotheecour commented Feb 29, 2020

nitely commented Feb 29, 2020

nitely commented Feb 29, 2020 •

edited

Loading

timotheecour commented Feb 29, 2020 •

edited

Loading

performance comparison vs std/re, std/nre #57

performance comparison vs std/re, std/nre #57

Comments

timotheecour commented Feb 29, 2020 • edited Loading

links

nitely commented Feb 29, 2020

timotheecour commented Feb 29, 2020

D's std.regex

alternative approach: better PCRE wrapper

nitely commented Feb 29, 2020 • edited Loading

timotheecour commented Feb 29, 2020

nitely commented Feb 29, 2020

nitely commented Feb 29, 2020 • edited Loading

timotheecour commented Feb 29, 2020 • edited Loading

timotheecour commented Feb 29, 2020 •

edited

Loading

nitely commented Feb 29, 2020 •

edited

Loading

nitely commented Feb 29, 2020 •

edited

Loading

timotheecour commented Feb 29, 2020 •

edited

Loading