-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search is extremely slow when subject string is repeating the start of the search pattern #101
Comments
Is there a reason why you are using the 8-bit PCRE2 library to match 16-bit
characters? The 16-bit library would surely be faster, but I can see that
maybe you want to process BE characters on LE hardware.
You do not say whether or not you are using JIT. If you are, Zoltan may
like to comment. If you are using the interpreter, the following comments
apply:
When your pattern is just "ABC" PCRE2 knows that the first byte must be
"A". This triggers a start-up optimization. It does a pre-match search
along the string for this byte, and quickly fails to find it. When your
pattern starts \x00A the first byte of a match must be \x00. The pre-match
search keeps finding this byte, and so the the main matching code keeps
getting started (and of course fails on the rest of the pattern). So yes,
it is going to be slow when the target string contains many instances of a
fixed first code unit, but no actual pattern matches.
I have no idea why ARM should be different to Intel because I'm ignorant
about the hardware differences. The pre-match optimization search in the
8-bit library uses the memchr() function which presumably is coded
optimally for the underlying hardware. In the "ABC" case there will only be
one (failing) call to memchr(), which is why this is super fast.
A final comment: I realize this might just be a test case, but if you are
really searching for fixed strings (that is, without any repetition or
alternation or other RE features), there are faster algorithms that will
easily do better than using a regex pattern match.
…On Sun, 17 Apr 2022 at 19:18, Thomas Tempelmann ***@***.***> wrote:
I am using PCRE to find UTF-16 BE text.
For example, to find "ABC", I'd use this pattern: \x00A\x00B\x00C
This works pretty well with PCRE, in general. But with some target files,
eg. a huge file (4GB) consisting only of 00s, it takes about 20s on a Intel
i7 @ 5 GHz, and even 90s on a Apple Silicon (M1 Air). As soon as I remove
the leading \x00, it's super fast.
So there are two issues:
1. Why is the ARM version so much slower than the Intel version? Is
this one of those cases where there's a special Intel CPU instruction that
the ARM doesn't have? Or is there some other issue with the code, perhaps?
2. Is this a general weakness of the RE algorithm implementation that
it's going so slow when it's asked to find something where the target
string is a long repetition of the beginning of the pattern? I can imagine
so, suspecting there's some stacking and backtracking happening, but I just
wonder if that needs to result in this ugly effect. I'm not asking for a
special treatment, just asking in general if there might be an oversight in
the code. If not, just close as "works as designed".
—
Reply to this email directly, view it on GitHub
<#101>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG4QUAE4UZX6YRKFAXXWR5DVFRIYDANCNFSM5TUGNTOQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The reason for doing it this way: I'm trying to find text in any kind of file, and where the text may appear in either UTF-8 or UTF-16 format. Instead of scanning the files twice, I create a regex pattern that combines all the possible writings of the text that's to be found, including even de- vs precomposed chars. And since I don't know whether the target uses UTF-16 in LE or BE format, I try to handle both those cases as well. The example given is just a subset of the pattern I generate, for the purpose of clearer demonstration of the performance issue I noticed. And while the issue probably would also appear when I have a file filled with "A"s, and my pattern looks for "ABC", such files are less likely to occur than a file with lots of 00s in it (in my case: An sector image file from an SSD where this is quite common). I also found that the issue appears with both JIT and without, that why I didn't point it out. We had a previous conversation about a similar issue where an optimization would only be done in one case, so I first checked if this applies here, and found that this is a different issue, probably. |
Seems to work "as designed" |
In general, more complex patterns are harder to optimize, so they tend to be slower than splitting it to multiple simpler patterns. To support the latter, PCRE2 supports match ranges. For example, if you have 3 patterns, you can split your input to 10K blocks, match that 3 pattern to each block. If multiple patterns match, choose the one with the smallest starting point. That can be much faster than combining them into one pattern. Another option is using multiple cores, and do the 3 match in different cores. |
I am using PCRE to find UTF-16 BE text.
For example, to find "ABC", I'd use this pattern:
\x00A\x00B\x00C
This works pretty well with PCRE, in general. But with some target files, eg. a huge file (4GB) consisting only of 00s, it takes about 20s on a Intel i7 @ 5 GHz, and even 90s on a Apple Silicon (M1 Air). As soon as I remove the leading \x00, it's super fast.
So there are two issues:
I'm providing a sample file here: http://files.tempel.org/Various/zip64sample.zip.gz
I am using the 8bit version of calls use use the following match options:
BTW, I was able to work around this issue by removing the leading NUL char and implementing a
pcre2callout
function that gets called at the end of the match, where it then checks if the byte before the start of the match is zero (and I also handle the special case where the match is at the subject's start). Now runs much faster with the occasional 00-filled file, with the M1 being also much faster than the Intel Mac, finally.The text was updated successfully, but these errors were encountered: