Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support accepting compiled regular expression objects from re.compile() as a pattern #7904

Closed
Nicholas-7 opened this issue Apr 8, 2021 · 4 comments · Fixed by #9573
Assignees
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@Nicholas-7
Copy link

I’d like for cuDF to support accepting compiled regular expression objects from re.compile() as a pattern similarly to how Pandas functions.

Example:

import re

cudf_regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)

cudfSeries4.str.replace(cudf_regex_pat, "XX-XX ", regex=True)

Result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-65e18a60f111> in <module>
      3 cudf_regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)
      4 # TypeError: object of type 're.Pattern' has no len()
----> 5 cudfSeries4.str.(cudf_regex_pat, "XX-XX ", regex=True)

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/column/string.py in replace(self, pat, repl, n, case, flags, regex)
    776         return self._return_or_inplace(
    777             cpp_replace_re(self._column, pat, cudf.Scalar(repl, "str"), n)
--> 778             if regex is True and len(pat) > 1
    779             else cpp_replace(
    780                 self._column,

TypeError: object of type 're.Pattern' has no len()

Pandas does this by utilizing the replace method accepting a compiled regular expression object from re.compile() as a pattern. All flags should be included in the compiled regular expression object.:

import re
pandas_regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)
pandasSeries4.str.replace(pandas_regex_pat, "XX-XX ", regex=True)

Output

0 A
1 B
2 C
3 XX-XX ba
4 XX-XX ca
5
6
7 XX-XX BA
8 XX-XX
9 XX-XX t
dtype: string

@Nicholas-7 Nicholas-7 added feature request New feature or request Needs Triage Need team to review and classify labels Apr 8, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 20, 2021
@kkraus14
Copy link
Collaborator

Hmm, we can extract the pattern and the flags from the compiled regular expression here, but I don't see a clear way to translate that into something libcudf can use.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@davidwendt
Copy link
Contributor

It looks like we could extract the pattern from the compiled regex object.

import re
p = re.compile(r"^.a|dog")
print(p.pattern)

^.a|dog

The p.pattern string could be passed to libcudf regex functions.

@beckernick beckernick added good first issue Good for newcomers and removed inactive-30d labels Oct 29, 2021
@beckernick
Copy link
Member

For additional information: https://docs.python.org/3/library/re.html#re.Pattern.pattern

@davidwendt davidwendt self-assigned this Nov 1, 2021
@rapids-bot rapids-bot bot closed this as completed in #9573 Nov 2, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 2, 2021
Closes #7904 

Depends on  #9439 

Adds support to accepting a pattern argument in cudf `str.replace` that is built using `re.compile`. The resulting `re.Pattern` has a member `pattern` that is the regex string that can be passed to libcudf. There is also `flags` member that can be used for libcudf APIs that accept a `regex_flags` parameter along with the `pattern` string.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - https://github.com/brandon-b-miller
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9573
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants