-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Validate the size/complexity of regular expressions #4061
Comments
Note that if we end up needing to increase the default reserve memory to better handle complex regex kernels being launched with large thread stack space requirements, this may impact the use-cases that prompted lowering the default reserve recently. cc: @rongou |
The main source of OOM errors for regex kernels comes from a stack allocation of 11 bytes per instruction per string at kernel launch. “Small” for 1-10 instructions, “medium” for 11-100 instructions and “large” for 101-1000 instructions. We need to clarify exactly what "instructions" means and how it maps to the
Also, see cuDF cpp/src/strings/regex/regcomp.h: enum InstType {
CHAR = 0177, // Literal character
RBRA = 0201, // Right bracket, )
LBRA = 0202, // Left bracket, (
OR = 0204, // Alternation, |
ANY = 0300, // Any character except newline, .
ANYNL = 0301, // Any character including newline, .
BOL = 0303, // Beginning of line, ^
EOL = 0304, // End of line, $
CCLASS = 0305, // Character class, []
NCCLASS = 0306, // Negated character class, []
BOW = 0307, // Boundary of word, /b
NBOW = 0310, // Not boundary of word, /b
END = 0377 // Terminate: match found
}; |
No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly. |
Exactly. This is a work around to issues with memory problems when running regular expressions. |
|
Also, see this issue for the more desired functionality from cuDF: rapidsai/cudf#10852 |
Here's a possible way to compute a first pass of this without a device:
|
Fixed in #6006 |
Is your feature request related to a problem? Please describe.
Once #4044 is merged, we have the ability to parse regular expressions. We could analyze the parsed expression and estimate the complexity/cost of GPU kernels and fall back to CPU if the GPU cost is high.
Describe the solution you'd like
See #4044 (comment) for more details.
Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: