[FEA] Validate the size/complexity of regular expressions #4061

andygrove · 2021-11-09T16:56:32Z

Is your feature request related to a problem? Please describe.
Once #4044 is merged, we have the ability to parse regular expressions. We could analyze the parsed expression and estimate the complexity/cost of GPU kernels and fall back to CPU if the GPU cost is high.

Describe the solution you'd like
See #4044 (comment) for more details.

Describe alternatives you've considered
None

Additional context
None

jlowe · 2021-11-09T22:33:18Z

Note that if we end up needing to increase the default reserve memory to better handle complex regex kernels being launched with large thread stack space requirements, this may impact the use-cases that prompted lowering the default reserve recently. cc: @rongou

andygrove · 2022-01-12T18:52:51Z

The main source of OOM errors for regex kernels comes from a stack allocation of 11 bytes per instruction per string at kernel launch. “Small” for 1-10 instructions, “medium” for 11-100 instructions and “large” for 101-1000 instructions.

We need to clarify exactly what "instructions" means and how it maps to the RegexAST that we have in the RAPIDS accelerator, but here is one example of GPU regexp code showing some instructions.

Flags = 0x00000000
Instructions:
  0:    CHAR c='3', next=2
  1:      OR right=0, left=2, next=2
  2:    RBRA id=1, next=4
  3:    LBRA id=1, next=1
  4:      OR right=3, left=5, next=5
  5:     END
startinst_id=3
startinst_ids: [ 3 -1]

Classes 0
Number of capturing groups: 1

Also, see cuDF cpp/src/strings/regex/regcomp.h:

enum InstType {
  CHAR    = 0177,  // Literal character
  RBRA    = 0201,  // Right bracket, )
  LBRA    = 0202,  // Left bracket, (
  OR      = 0204,  // Alternation, |
  ANY     = 0300,  // Any character except newline, .
  ANYNL   = 0301,  // Any character including newline, .
  BOL     = 0303,  // Beginning of line, ^
  EOL     = 0304,  // End of line, $
  CCLASS  = 0305,  // Character class, []
  NCCLASS = 0306,  // Negated character class, []
  BOW     = 0307,  // Boundary of word, /b
  NBOW    = 0310,  // Not boundary of word, /b
  END     = 0377   // Terminate: match found
};

andygrove · 2022-03-17T16:10:17Z

@revans2 @jlowe Would this feature still be required if cuDF global-memory use can be made to perform near the speed of stack memory? The memory would come from RMM instead of the CUDA stack in this case.

jlowe · 2022-03-17T16:17:26Z

No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.

revans2 · 2022-03-17T16:25:22Z

No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.

Exactly. This is a work around to issues with memory problems when running regular expressions.

NVnavkumar · 2022-06-21T21:33:17Z

~~This PR rapidsai/cudf#10808 is probably useful for this issue~~. This PR has been closed due to the device requirement of the API in question.

NVnavkumar · 2022-06-21T23:21:18Z

Also, see this issue for the more desired functionality from cuDF: rapidsai/cudf#10852

NVnavkumar · 2022-07-08T21:17:46Z

Here's a possible way to compute a first pass of this without a device:

First, note that regular expressions are evaluated using a state machine.
The RegexAST class contains individual instances that could be evaluated as a state in the state machine, so
One can use the existing parser/transpiler code to calculate how many states will potentially be needed to be evaluated in the final regular expression that will be sent to cuDF.
Here's a rough example (there may be more states needed, but will need to research more on this topic)
- A$ will be transpiled to A(?:[\n\r\u0085\u2028\u2029]|\r\n)?$
- evaluating the A char - 1 state
- evaluating the group - no capture, so no state needed
- first character class [\n\r\u0085\u2028\u2029] - 2 states
- choice - 3 states
- second character class \r\n - 4 states
- end of string anchor $ - 5 states
so each evaluation of this regexp needs 5 states, which can then be multiplied by the number of rows in the batch sent to the GPU. numStates * numRows = complexity
if complexity > maxComplexity, then we will fallback to the CPU
maxComplexity can be user configurable, something like spark.rapids.sql.regexp.maxComplexity (and this can default to an arbitrarily high value).
Keep in mind also that this can be evaluated recursively in the transpiler code, and the exception can be thrown in that code to fallback to the CPU.

andygrove · 2022-08-19T17:44:31Z

Fixed in #6006

andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 9, 2021

andygrove mentioned this issue Nov 9, 2021

RLike: Fall back to CPU for regex that would produce incorrect results #4044

Merged

Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Nov 9, 2021

andygrove added this to the Jan 10 - Jan 28 milestone Jan 7, 2022

andygrove self-assigned this Jan 7, 2022

andygrove mentioned this issue Jan 12, 2022

[FEA] Enable regular expressions by default #4509

Open

61 tasks

andygrove removed their assignment Jan 20, 2022

andygrove modified the milestones: Jan 10 - Jan 28, Jan 31 - Feb 11 Jan 31, 2022

sameerz removed this from the Jan 31 - Feb 11 milestone Feb 15, 2022

andygrove self-assigned this Mar 1, 2022

andygrove added this to the Feb 28 - Mar 18 milestone Mar 1, 2022

andygrove removed this from the Feb 28 - Mar 18 milestone Mar 17, 2022

andygrove removed their assignment Mar 24, 2022

revans2 mentioned this issue Apr 8, 2022

[TASK] Big Reliability Epic #1870

Closed

14 tasks

revans2 added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Apr 12, 2022

anthony-chang self-assigned this Jul 8, 2022

anthony-chang mentioned this issue Jul 15, 2022

Estimate and validate regular expression complexities #6006

Merged

anthony-chang removed their assignment Aug 17, 2022

andygrove closed this as completed Aug 19, 2022

sameerz changed the title ~~[FEA] Validate the size/complexity of regular expressions~~ [FEA] Validate the size/complexity of regular expressions Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Validate the size/complexity of regular expressions #4061

[FEA] Validate the size/complexity of regular expressions #4061

andygrove commented Nov 9, 2021

jlowe commented Nov 9, 2021

andygrove commented Jan 12, 2022 •

edited

Loading

andygrove commented Mar 17, 2022

jlowe commented Mar 17, 2022

revans2 commented Mar 17, 2022

NVnavkumar commented Jun 21, 2022 •

edited

Loading

NVnavkumar commented Jun 21, 2022

NVnavkumar commented Jul 8, 2022 •

edited

Loading

andygrove commented Aug 19, 2022

[FEA] Validate the size/complexity of regular expressions #4061

[FEA] Validate the size/complexity of regular expressions #4061

Comments

andygrove commented Nov 9, 2021

jlowe commented Nov 9, 2021

andygrove commented Jan 12, 2022 • edited Loading

andygrove commented Mar 17, 2022

jlowe commented Mar 17, 2022

revans2 commented Mar 17, 2022

NVnavkumar commented Jun 21, 2022 • edited Loading

NVnavkumar commented Jun 21, 2022

NVnavkumar commented Jul 8, 2022 • edited Loading

andygrove commented Aug 19, 2022

andygrove commented Jan 12, 2022 •

edited

Loading

NVnavkumar commented Jun 21, 2022 •

edited

Loading

NVnavkumar commented Jul 8, 2022 •

edited

Loading