Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Validate the size/complexity of regular expressions #4061

Closed
andygrove opened this issue Nov 9, 2021 · 9 comments
Closed

[FEA] Validate the size/complexity of regular expressions #4061

andygrove opened this issue Nov 9, 2021 · 9 comments
Labels
feature request New feature or request P1 Nice to have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Once #4044 is merged, we have the ability to parse regular expressions. We could analyze the parsed expression and estimate the complexity/cost of GPU kernels and fall back to CPU if the GPU cost is high.

Describe the solution you'd like
See #4044 (comment) for more details.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 9, 2021
@Salonijain27 Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Nov 9, 2021
@jlowe
Copy link
Member

jlowe commented Nov 9, 2021

Note that if we end up needing to increase the default reserve memory to better handle complex regex kernels being launched with large thread stack space requirements, this may impact the use-cases that prompted lowering the default reserve recently. cc: @rongou

@andygrove andygrove added this to the Jan 10 - Jan 28 milestone Jan 7, 2022
@andygrove andygrove self-assigned this Jan 7, 2022
@andygrove
Copy link
Contributor Author

andygrove commented Jan 12, 2022

The main source of OOM errors for regex kernels comes from a stack allocation of 11 bytes per instruction per string at kernel launch. “Small” for 1-10 instructions, “medium” for 11-100 instructions and “large” for 101-1000 instructions.

We need to clarify exactly what "instructions" means and how it maps to the RegexAST that we have in the RAPIDS accelerator, but here is one example of GPU regexp code showing some instructions.

Flags = 0x00000000
Instructions:
  0:    CHAR c='3', next=2
  1:      OR right=0, left=2, next=2
  2:    RBRA id=1, next=4
  3:    LBRA id=1, next=1
  4:      OR right=3, left=5, next=5
  5:     END
startinst_id=3
startinst_ids: [ 3 -1]

Classes 0
Number of capturing groups: 1

Also, see cuDF cpp/src/strings/regex/regcomp.h:

enum InstType {
  CHAR    = 0177,  // Literal character
  RBRA    = 0201,  // Right bracket, )
  LBRA    = 0202,  // Left bracket, (
  OR      = 0204,  // Alternation, |
  ANY     = 0300,  // Any character except newline, .
  ANYNL   = 0301,  // Any character including newline, .
  BOL     = 0303,  // Beginning of line, ^
  EOL     = 0304,  // End of line, $
  CCLASS  = 0305,  // Character class, []
  NCCLASS = 0306,  // Negated character class, []
  BOW     = 0307,  // Boundary of word, /b
  NBOW    = 0310,  // Not boundary of word, /b
  END     = 0377   // Terminate: match found
};

@andygrove andygrove removed their assignment Jan 20, 2022
@sameerz sameerz removed this from the Jan 31 - Feb 11 milestone Feb 15, 2022
@andygrove andygrove self-assigned this Mar 1, 2022
@andygrove andygrove added this to the Feb 28 - Mar 18 milestone Mar 1, 2022
@andygrove
Copy link
Contributor Author

@revans2 @jlowe Would this feature still be required if cuDF global-memory use can be made to perform near the speed of stack memory? The memory would come from RMM instead of the CUDA stack in this case.

@jlowe
Copy link
Member

jlowe commented Mar 17, 2022

No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.

@revans2
Copy link
Collaborator

revans2 commented Mar 17, 2022

No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.

Exactly. This is a work around to issues with memory problems when running regular expressions.

@andygrove andygrove removed this from the Feb 28 - Mar 18 milestone Mar 17, 2022
@andygrove andygrove removed their assignment Mar 24, 2022
@revans2 revans2 mentioned this issue Apr 8, 2022
14 tasks
@revans2 revans2 added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Apr 12, 2022
@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Jun 21, 2022

This PR rapidsai/cudf#10808 is probably useful for this issue. This PR has been closed due to the device requirement of the API in question.

@NVnavkumar
Copy link
Collaborator

Also, see this issue for the more desired functionality from cuDF: rapidsai/cudf#10852

@anthony-chang anthony-chang self-assigned this Jul 8, 2022
@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Jul 8, 2022

Here's a possible way to compute a first pass of this without a device:

  • First, note that regular expressions are evaluated using a state machine.
  • The RegexAST class contains individual instances that could be evaluated as a state in the state machine, so
  • One can use the existing parser/transpiler code to calculate how many states will potentially be needed to be evaluated in the final regular expression that will be sent to cuDF.
  • Here's a rough example (there may be more states needed, but will need to research more on this topic)
    • A$ will be transpiled to A(?:[\n\r\u0085\u2028\u2029]|\r\n)?$
    • evaluating the A char - 1 state
    • evaluating the group - no capture, so no state needed
    • first character class [\n\r\u0085\u2028\u2029] - 2 states
    • choice - 3 states
    • second character class \r\n - 4 states
    • end of string anchor $ - 5 states
  • so each evaluation of this regexp needs 5 states, which can then be multiplied by the number of rows in the batch sent to the GPU. numStates * numRows = complexity
  • if complexity > maxComplexity, then we will fallback to the CPU
  • maxComplexity can be user configurable, something like spark.rapids.sql.regexp.maxComplexity (and this can default to an arbitrarily high value).
  • Keep in mind also that this can be evaluated recursively in the transpiler code, and the exception can be thrown in that code to fallback to the CPU.

@andygrove
Copy link
Contributor Author

Fixed in #6006

@sameerz sameerz changed the title [FEA] Validate the size/complexity of regular expressions [FEA] Validate the size/complexity of regular expressions Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request P1 Nice to have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

7 participants