Search is extremely slow when subject string is repeating the start of the search pattern #101

tempelmann · 2022-04-17T18:18:30Z

I am using PCRE to find UTF-16 BE text.

For example, to find "ABC", I'd use this pattern: \x00A\x00B\x00C

This works pretty well with PCRE, in general. But with some target files, eg. a huge file (4GB) consisting only of 00s, it takes about 20s on a Intel i7 @ 5 GHz, and even 90s on a Apple Silicon (M1 Air). As soon as I remove the leading \x00, it's super fast.

So there are two issues:

Why is the ARM version so much slower than the Intel version? Is this one of those cases where there's a special Intel CPU instruction that the ARM doesn't have? Or is there some other issue with the code, perhaps?
Is this a general weakness of the RE algorithm implementation that it's going so slow when it's asked to find something where the target string is a long repetition of the beginning of the pattern? I can imagine so, suspecting there's some stacking and backtracking happening, but I just wonder if that needs to result in this ugly effect. I'm not asking for a special treatment, just asking in general if there might be an oversight in the code. If not, just close as "works as designed".

I'm providing a sample file here: http://files.tempel.org/Various/zip64sample.zip.gz

I am using the 8bit version of calls use use the following match options:

PCRE2_PARTIAL_HARD | PCRE2_NOTBOL | PCRE2_NOTEOL | PCRE2_NOTEMPTY | PCRE2_NO_UTF_CHECK

BTW, I was able to work around this issue by removing the leading NUL char and implementing a pcre2callout function that gets called at the end of the match, where it then checks if the byte before the start of the match is zero (and I also handle the special case where the match is at the subject's start). Now runs much faster with the occasional 00-filled file, with the M1 being also much faster than the Intel Mac, finally.

The text was updated successfully, but these errors were encountered:

PhilipHazel · 2022-04-18T08:07:59Z

Is there a reason why you are using the 8-bit PCRE2 library to match 16-bit characters? The 16-bit library would surely be faster, but I can see that maybe you want to process BE characters on LE hardware. You do not say whether or not you are using JIT. If you are, Zoltan may like to comment. If you are using the interpreter, the following comments apply: When your pattern is just "ABC" PCRE2 knows that the first byte must be "A". This triggers a start-up optimization. It does a pre-match search along the string for this byte, and quickly fails to find it. When your pattern starts \x00A the first byte of a match must be \x00. The pre-match search keeps finding this byte, and so the the main matching code keeps getting started (and of course fails on the rest of the pattern). So yes, it is going to be slow when the target string contains many instances of a fixed first code unit, but no actual pattern matches. I have no idea why ARM should be different to Intel because I'm ignorant about the hardware differences. The pre-match optimization search in the 8-bit library uses the memchr() function which presumably is coded optimally for the underlying hardware. In the "ABC" case there will only be one (failing) call to memchr(), which is why this is super fast. A final comment: I realize this might just be a test case, but if you are really searching for fixed strings (that is, without any repetition or alternation or other RE features), there are faster algorithms that will easily do better than using a regex pattern match.

…

On Sun, 17 Apr 2022 at 19:18, Thomas Tempelmann ***@***.***> wrote: I am using PCRE to find UTF-16 BE text. For example, to find "ABC", I'd use this pattern: \x00A\x00B\x00C This works pretty well with PCRE, in general. But with some target files, eg. a huge file (4GB) consisting only of 00s, it takes about 20s on a Intel i7 @ 5 GHz, and even 90s on a Apple Silicon (M1 Air). As soon as I remove the leading \x00, it's super fast. So there are two issues: 1. Why is the ARM version so much slower than the Intel version? Is this one of those cases where there's a special Intel CPU instruction that the ARM doesn't have? Or is there some other issue with the code, perhaps? 2. Is this a general weakness of the RE algorithm implementation that it's going so slow when it's asked to find something where the target string is a long repetition of the beginning of the pattern? I can imagine so, suspecting there's some stacking and backtracking happening, but I just wonder if that needs to result in this ugly effect. I'm not asking for a special treatment, just asking in general if there might be an oversight in the code. If not, just close as "works as designed". — Reply to this email directly, view it on GitHub <#101>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG4QUAE4UZX6YRKFAXXWR5DVFRIYDANCNFSM5TUGNTOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

tempelmann · 2022-04-18T12:17:40Z

The reason for doing it this way:

I'm trying to find text in any kind of file, and where the text may appear in either UTF-8 or UTF-16 format. Instead of scanning the files twice, I create a regex pattern that combines all the possible writings of the text that's to be found, including even de- vs precomposed chars.

And since I don't know whether the target uses UTF-16 in LE or BE format, I try to handle both those cases as well.

The example given is just a subset of the pattern I generate, for the purpose of clearer demonstration of the performance issue I noticed.

And while the issue probably would also appear when I have a file filled with "A"s, and my pattern looks for "ABC", such files are less likely to occur than a file with lots of 00s in it (in my case: An sector image file from an SSD where this is quite common).

I also found that the issue appears with both JIT and without, that why I didn't point it out. We had a previous conversation about a similar issue where an optimization would only be done in one case, so I first checked if this applies here, and found that this is a different issue, probably.

tempelmann · 2022-04-18T12:18:19Z

Seems to work "as designed"

zherczeg · 2022-04-18T15:04:08Z

In general, more complex patterns are harder to optimize, so they tend to be slower than splitting it to multiple simpler patterns. To support the latter, PCRE2 supports match ranges. For example, if you have 3 patterns, you can split your input to 10K blocks, match that 3 pattern to each block. If multiple patterns match, choose the one with the smallest starting point. That can be much faster than combining them into one pattern. Another option is using multiple cores, and do the 3 match in different cores.

tempelmann changed the title ~~Search is extremely slow when target is repeating the start of the search pattern~~ Search is extremely slow when subject string is repeating the start of the search pattern Apr 17, 2022

tempelmann closed this as completed Apr 18, 2022

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search is extremely slow when subject string is repeating the start of the search pattern #101

Search is extremely slow when subject string is repeating the start of the search pattern #101

tempelmann commented Apr 17, 2022 •

edited

Loading

PhilipHazel commented Apr 18, 2022 via email

tempelmann commented Apr 18, 2022

tempelmann commented Apr 18, 2022

zherczeg commented Apr 18, 2022

Search is extremely slow when subject string is repeating the start of the search pattern #101

Search is extremely slow when subject string is repeating the start of the search pattern #101

Comments

tempelmann commented Apr 17, 2022 • edited Loading

PhilipHazel commented Apr 18, 2022 via email

tempelmann commented Apr 18, 2022

tempelmann commented Apr 18, 2022

zherczeg commented Apr 18, 2022

tempelmann commented Apr 17, 2022 •

edited

Loading