Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up regular expression substitution #91524

Closed
serhiy-storchaka opened this issue Apr 14, 2022 · 0 comments · Fixed by #91525
Closed

Speed up regular expression substitution #91524

serhiy-storchaka opened this issue Apr 14, 2022 · 0 comments · Fixed by #91525
Labels
3.12 bugs and security fixes performance Performance or resource usage topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

re.sub() is relatively slow, because for every match it calls a Python code.

Implementing it in C allows to speed up re.sub() to 2-3 times.

$ ./python -m timeit -s 'import re; s = "a"' 're.sub("(a)", r"\1", s)'
100000 loops, best of 5: 2.45 usec per loop
500000 loops, best of 5: 860 nsec per loop
$ ./python -m timeit -s 'import re; s = "a"; p = re.compile("(a)")' 'p.sub(r"\1", s)'
200000 loops, best of 5: 1.79 usec per loop
500000 loops, best of 5: 546 nsec per loop
$ ./python -m timeit -s 'import re; s = "a"*10**3' 're.sub("(a)", r"\1", s)'
500 loops, best of 5: 620 usec per loop
1000 loops, best of 5: 252 usec per loop
$ ./python -m timeit -s 'import re; s = "a"' 're.sub("(a)", r"b", s)'
500000 loops, best of 5: 711 nsec per loop
500000 loops, best of 5: 663 nsec per loop
$ ./python -m timeit -s 'import re; s = "a"' 're.sub("(a)", r"\n", s)'
200000 loops, best of 5: 1.7 usec per loop
500000 loops, best of 5: 864 nsec per loop

Initially I also implemented a public API for explicit compilation of the replacement string, but then left it to a separate issue.

@serhiy-storchaka serhiy-storchaka added type-feature A feature request or enhancement topic-regex 3.11 only security fixes labels Apr 14, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 14, 2022
Functions re.sub() and re.subn() and corresponding re.Pattern methods
are now 2-3 times faster for replacement strings containing group references.
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Apr 14, 2022
@iritkatriel iritkatriel added 3.12 bugs and security fixes and removed 3.11 only security fixes labels Sep 7, 2022
gpshead pushed a commit that referenced this issue Oct 23, 2022
Functions re.sub() and re.subn() and corresponding re.Pattern methods
are now 2-3 times faster for replacement strings containing group references.

Closes #91524

Primarily authored by serhiy-storchaka Serhiy Storchaka
Minor-cleanups-by: Gregory P. Smith [Google] <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes performance Performance or resource usage topic-regex type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants