Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance comparison vs std/re, std/nre #57

Closed
timotheecour opened this issue Feb 29, 2020 · 7 comments · Fixed by #58
Closed

performance comparison vs std/re, std/nre #57

timotheecour opened this issue Feb 29, 2020 · 7 comments · Fixed by #58

Comments

@timotheecour
Copy link
Contributor

timotheecour commented Feb 29, 2020

It would be really nice to have a benchmark comparing performance wrt std/re, std/nre; I haven't seen any.
This would also be useful for performance regression testing.
This could start small and become more comprehensive as needed.

links

https://forum.nim-lang.org/t/6008#37206

@nitely
Copy link
Owner

nitely commented Feb 29, 2020

IIRC it's 40 times slower. Someone reported it to be 5 times slower. It really depends on the regex, PCRE has exponential time complexity, and while those cases are kind of rare, it's easy to build a regex that runs in quadratic time (and/or does a lot of backtracking), so nim-regex is likely faster in such cases.

I'm working on making it faster (as nregex when there are capture groups, i.e: 4 times slower than PCRE), but it'll take me a little while to get it done.

@timotheecour
Copy link
Contributor Author

oh yikes; maybe 1st step to improve this would be adding some benchmark code somewhere and keep track of performance... if you have already one maybe link it here?

I suggest using this: https://github.com/mariomka/regex-benchmark

any ideas where the bottlnecks, and whether this is an implementation issue or if the design itself is the limiting factor compared to PCRE?

D's std.regex

I read a few times that D has/had the fastest regex engine (https://dlang.org/library/std/regex.html), but I can't find a recent benchmark on this:
https://forum.dlang.org/post/[email protected]

It's kinda sad because std.regex used to be touted as the fastest regex engine even compared to other regex libraries. But it seems others have caught up since while we've stagnated. :-/

it seems a bit in contradiction with numbers from https://github.com/mariomka/regex-benchmark ; /cc @DmitryOlshansky do you have any insights / links?

If it is one of the fastest, may be good to look into its design given some similarities that D and nim have (in particular, both being able to do CTFE, which is relevant for this discussion, since std.regex uses that for bytecode (relevant even if we're in the end talking about running regex at RT))

here's his (dated) blog post: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ which has interesting insights

alternative approach: better PCRE wrapper

I'm wondering whether we can as an alternative wrap PCRE but using a better API than nim's std/re, std/nre, which have a number of issues.
as for running regex at CT, I tried re and nre with -d:nimHasLibFFI; it "almost works" but needs small (maybe) language change to relax when casting in VM is allowed; so we could have high performance using both RT and CT

@nitely
Copy link
Owner

nitely commented Feb 29, 2020

any ideas where the bottlnecks, and whether this is an implementation issue or if the design itself is the limiting factor compared to PCRE?

It's a design issue, AFAIK. nim-regex is based on Thompson's construction and multiple-state simulation, as described by Russ Cox [3]. Same as RE2, golang's regex, and rust-regex, albeit they have some optimizations to avoid falling back to the slow NFA engine.

Can a non-backtracking NFA based engine consistently (as in for every regex) reach PCRE performance? ¯\_(ツ)_/¯. I've never seen it happen. Thompson's algorithms don't come close, I can tell you that much.

I plan to improve the design and add some RE2 optimizations (the on-the-fly bounded-DFA construction at least) to make nim-regex faster than it's now.

I read a few times that D has/had the fastest regex engine (https://dlang.org/library/std/regex.html), but I can't find a recent benchmark on this:
https://forum.dlang.org/post/[email protected]

Nim has the fastest regex engine that I know of [4], at least for expressions without capture groups :p, which surprisingly covers a lot of cases: regex validation, search/find boundaries, search&replace, etc. But DFAs have their own drawbacks (i.e: exponential construction/compilation blow-up in rare cases).

If it is one of the fastest, may be good to look into its design given some similarities that D and nim have (in particular, both being able to do CTFE, which is relevant for this discussion, since std.regex uses that for bytecode (relevant even if we're in the end talking about running regex at RT))

here's his (dated) blog post: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ which has interesting insights

That's a good read, but Bit-Parallel NFA Simulation (Glushkov's NFA simulation) is not something new. While it's faster than Thompson's simulation, it's not faster than a backtracking JITted simulation. Also, It's not a general solution (i.e: it only supports NFAs smaller than WORD-size), I'd guess D uses it to find prefixes (i.e: it only uses the first WORD-size parallel states of the NFA).

The state of NFA based engines boils down to a bunch of speed-ups on top of a slow algorithm. NFA is a dead end, performance wise. D's regex, same as RE2 and rust-regex, are based on an NFA engine. They put some other engines on top that work in some cases depending on the regular expression, but they fallback to the dog slow NFA engine otherwise.

alternative approach: better PCRE wrapper

There are too many horror stories of using backtracking regex engines (such as PCRE) in production [0] [1] [2], though, and they are not necessarily fast, just fast in some cases. For example, the regexes in mariomka/regex-benchmark will have limited backtracking, and depending on the input, no backtracking at all, so PCRE looks good in those benchmarks.

@timotheecour
Copy link
Contributor Author

if the main problem for PCRE is good average performance but exponential worst case performance, can't a hybrid algorithm be used (at the expense of code complexity, granted)? That strategy worked out well in a different context here nim-lang/Nim#13440, but it's also a common strategy, eg introsort uses same trick https://en.wikipedia.org/wiki/Introsort

begins with quicksort, it switches to heapsort when the recursion depth exceeds a level based on (the logarithm of) the number of elements being sorted and it switches to insertion sort when the number of elements is below some threshold. This combines the good parts of the three algorithms [...]

@nitely
Copy link
Owner

nitely commented Feb 29, 2020

Sure, but PCRE is fast because it has a JIT compiler, and a lot of work went into optimizing both the compiler and the backtracking algo. idk whether it's feasible to replicate the work.

It'd be interesting to know why PCRE doesn't use some heuristic to switch to a linear time algorithm in the exponential time cases.

@nitely
Copy link
Owner

nitely commented Feb 29, 2020

I think the NFA/DFA hybrid approach (what RE2 and rust-regex do) would yield much better results, and make nim-regex fast enough. It's also easier to implement, so we'll see.

@timotheecour
Copy link
Contributor Author

timotheecour commented Feb 29, 2020

idk whether it's feasible to replicate the work

i meant wrap, not replicate here:

  • have a fast regex engine at runtime via reusing PCRE; with a strategy to switch to another algorithm (eg the one in nim-regex for eg) as fallback depth search is too high
  • expose it via a good API (maybe simply nim-regex's current API)
  • for running all this at CT (which would use importc pcre), there are 2 options:
    • -d:nimHasLibFFI
    • using vmops
    • or giving up on PCRE-level performance and using current nim-regex, which is fine for CT work
  • for js target I guess this wouldn't work, but fallback to current nim-regex is good enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants