-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance comparison vs std/re, std/nre #57
Comments
IIRC it's 40 times slower. Someone reported it to be 5 times slower. It really depends on the regex, PCRE has exponential time complexity, and while those cases are kind of rare, it's easy to build a regex that runs in quadratic time (and/or does a lot of backtracking), so nim-regex is likely faster in such cases. I'm working on making it faster (as nregex when there are capture groups, i.e: 4 times slower than PCRE), but it'll take me a little while to get it done. |
oh yikes; maybe 1st step to improve this would be adding some benchmark code somewhere and keep track of performance... if you have already one maybe link it here? I suggest using this: https://github.com/mariomka/regex-benchmark any ideas where the bottlnecks, and whether this is an implementation issue or if the design itself is the limiting factor compared to PCRE? D's std.regexI read a few times that D has/had the fastest regex engine (https://dlang.org/library/std/regex.html), but I can't find a recent benchmark on this:
it seems a bit in contradiction with numbers from https://github.com/mariomka/regex-benchmark ; /cc @DmitryOlshansky do you have any insights / links? If it is one of the fastest, may be good to look into its design given some similarities that D and nim have (in particular, both being able to do CTFE, which is relevant for this discussion, since std.regex uses that for bytecode (relevant even if we're in the end talking about running regex at RT)) here's his (dated) blog post: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ which has interesting insights alternative approach: better PCRE wrapperI'm wondering whether we can as an alternative wrap PCRE but using a better API than nim's std/re, std/nre, which have a number of issues. |
It's a design issue, AFAIK. nim-regex is based on Thompson's construction and multiple-state simulation, as described by Russ Cox [3]. Same as RE2, golang's regex, and rust-regex, albeit they have some optimizations to avoid falling back to the slow NFA engine. Can a non-backtracking NFA based engine consistently (as in for every regex) reach PCRE performance? I plan to improve the design and add some RE2 optimizations (the on-the-fly bounded-DFA construction at least) to make nim-regex faster than it's now.
Nim has the fastest regex engine that I know of [4], at least for expressions without capture groups :p, which surprisingly covers a lot of cases: regex validation, search/find boundaries, search&replace, etc. But DFAs have their own drawbacks (i.e: exponential construction/compilation blow-up in rare cases).
That's a good read, but Bit-Parallel NFA Simulation (Glushkov's NFA simulation) is not something new. While it's faster than Thompson's simulation, it's not faster than a backtracking JITted simulation. Also, It's not a general solution (i.e: it only supports NFAs smaller than WORD-size), I'd guess D uses it to find prefixes (i.e: it only uses the first WORD-size parallel states of the NFA). The state of NFA based engines boils down to a bunch of speed-ups on top of a slow algorithm. NFA is a dead end, performance wise. D's regex, same as RE2 and rust-regex, are based on an NFA engine. They put some other engines on top that work in some cases depending on the regular expression, but they fallback to the dog slow NFA engine otherwise.
There are too many horror stories of using backtracking regex engines (such as PCRE) in production [0] [1] [2], though, and they are not necessarily fast, just fast in some cases. For example, the regexes in |
if the main problem for PCRE is good average performance but exponential worst case performance, can't a hybrid algorithm be used (at the expense of code complexity, granted)? That strategy worked out well in a different context here nim-lang/Nim#13440, but it's also a common strategy, eg introsort uses same trick https://en.wikipedia.org/wiki/Introsort
|
Sure, but PCRE is fast because it has a JIT compiler, and a lot of work went into optimizing both the compiler and the backtracking algo. idk whether it's feasible to replicate the work. It'd be interesting to know why PCRE doesn't use some heuristic to switch to a linear time algorithm in the exponential time cases. |
I think the NFA/DFA hybrid approach (what RE2 and rust-regex do) would yield much better results, and make nim-regex fast enough. It's also easier to implement, so we'll see. |
i meant wrap, not replicate here:
|
It would be really nice to have a benchmark comparing performance wrt std/re, std/nre; I haven't seen any.
This would also be useful for performance regression testing.
This could start small and become more comprehensive as needed.
links
https://forum.nim-lang.org/t/6008#37206
The text was updated successfully, but these errors were encountered: