Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search is extremely slow when subject string is repeating the start of the search pattern #101

Closed
tempelmann opened this issue Apr 17, 2022 · 4 comments

Comments

@tempelmann
Copy link

tempelmann commented Apr 17, 2022

I am using PCRE to find UTF-16 BE text.

For example, to find "ABC", I'd use this pattern: \x00A\x00B\x00C

This works pretty well with PCRE, in general. But with some target files, eg. a huge file (4GB) consisting only of 00s, it takes about 20s on a Intel i7 @ 5 GHz, and even 90s on a Apple Silicon (M1 Air). As soon as I remove the leading \x00, it's super fast.

So there are two issues:

  1. Why is the ARM version so much slower than the Intel version? Is this one of those cases where there's a special Intel CPU instruction that the ARM doesn't have? Or is there some other issue with the code, perhaps?
  2. Is this a general weakness of the RE algorithm implementation that it's going so slow when it's asked to find something where the target string is a long repetition of the beginning of the pattern? I can imagine so, suspecting there's some stacking and backtracking happening, but I just wonder if that needs to result in this ugly effect. I'm not asking for a special treatment, just asking in general if there might be an oversight in the code. If not, just close as "works as designed".

I'm providing a sample file here: http://files.tempel.org/Various/zip64sample.zip.gz

I am using the 8bit version of calls use use the following match options:

PCRE2_PARTIAL_HARD | PCRE2_NOTBOL | PCRE2_NOTEOL | PCRE2_NOTEMPTY | PCRE2_NO_UTF_CHECK

BTW, I was able to work around this issue by removing the leading NUL char and implementing a pcre2callout function that gets called at the end of the match, where it then checks if the byte before the start of the match is zero (and I also handle the special case where the match is at the subject's start). Now runs much faster with the occasional 00-filled file, with the M1 being also much faster than the Intel Mac, finally.

@tempelmann tempelmann changed the title Search is extremely slow when target is repeating the start of the search pattern Search is extremely slow when subject string is repeating the start of the search pattern Apr 17, 2022
@PhilipHazel
Copy link
Collaborator

PhilipHazel commented Apr 18, 2022 via email

@tempelmann
Copy link
Author

The reason for doing it this way:

I'm trying to find text in any kind of file, and where the text may appear in either UTF-8 or UTF-16 format. Instead of scanning the files twice, I create a regex pattern that combines all the possible writings of the text that's to be found, including even de- vs precomposed chars.

And since I don't know whether the target uses UTF-16 in LE or BE format, I try to handle both those cases as well.

The example given is just a subset of the pattern I generate, for the purpose of clearer demonstration of the performance issue I noticed.

And while the issue probably would also appear when I have a file filled with "A"s, and my pattern looks for "ABC", such files are less likely to occur than a file with lots of 00s in it (in my case: An sector image file from an SSD where this is quite common).

I also found that the issue appears with both JIT and without, that why I didn't point it out. We had a previous conversation about a similar issue where an optimization would only be done in one case, so I first checked if this applies here, and found that this is a different issue, probably.

@tempelmann
Copy link
Author

Seems to work "as designed"

@zherczeg
Copy link
Collaborator

In general, more complex patterns are harder to optimize, so they tend to be slower than splitting it to multiple simpler patterns. To support the latter, PCRE2 supports match ranges. For example, if you have 3 patterns, you can split your input to 10K blocks, match that 3 pattern to each block. If multiple patterns match, choose the one with the smallest starting point. That can be much faster than combining them into one pattern. Another option is using multiple cores, and do the 3 match in different cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants