-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace re with re2 #32303
Replace re with re2 #32303
Conversation
e4a88f9
to
86c83ae
Compare
86c83ae
to
52966de
Compare
Maybe it is good to know why? It brings in a non core dependency that needs to be maintained? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a commit message that explains why and a removal of "import re2 as re".
@@ -17,7 +17,7 @@ | |||
"""Providers sub-commands.""" | |||
from __future__ import annotations | |||
|
|||
import re | |||
import re2 as re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is a functional difference between re
and re2
we are bound to end up in issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for security purpose
You can find more info here:
https://lists.apache.org/thread/lytmbn1xf9vwgwfwgp4vrm3vshn8p1tm
https://github.com/airflow-s/airflow-s/issues/19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution is vague then in this case and a (innocuous looking) functional commit message to show that it addresses the core of the problem is required. What does re2 solve that re doesn't and we cannot do otherwise and why?
Additionally re2 is not just a drop-in replacement as you have shown yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think the actual issue is that we are trusting user input here and re2 seems on the surface right now to just to be band-aid and not addressing the core of the problem. Also re2 fallsback to re if it doesn't know how to handle the regex. But maybe you can elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fallback is explicit in pyre2
(we are not using it), but I didn't find any mention of it in python google-re2
bindings. I tried a few unsupported regexp and it threw errors straight away. I believe there is no fallback, this was discussed on the issue. Did you find example or doc regarding fallback for google-re2
?
re2 solves the ReDos problem with linear time regexp engine.
I don't have newer arguments than what has already been discussed in the github issue. (backtracking problem, pyre2 vs google-re2, fallback etc.).
As it is not yet released with a patch for this I purposely didn't give too many details here.
If we prefer a different approach, just let me know I can close this in favor of other suggestions.
If needed can we follow up on the airflow-s
issue, or security mailing list ?
Note: Commit message has been updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think the actual issue is that we are trusting user input here and re2 seems on the surface right now to just to be band-aid and not addressing the core of the problem
Not really. I think it's not a band-aid, using rgular expression is part of our API specification, so we cannot really remove it unless we have a very good reason (and it's actually useful). So solving a potential way how you could (mostly accidentally) trigger the situation where it it will take a lot of time is the right approach - we do not want to remove the functionality there. Moreover - since we will already have the google-re2 dependency (which BTW is proven and battle tested because it is used internally in go
language), we can use the opportunity to use it elsewhere whre we use regular expressions and protect other pleaces.
Also re2 fallsback to re if it doesn't know how to handle the regex. But maybe you can elaborate?
The fallback is a mechanism for another library, It was a mistake to mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I stand corrected on mentioning the fallback mechanism. I had to find out myself as there was so little detail in the commit message.
The commit message could read:
"Use linear time regular expressions
The standard regexp library can consume > O(n) in certain circumstances. The re2 library does not have this issue.
"
Which clarifies, but doesnt give away the issue.
I do not fully agree with your assessment @potiuk that we want to keep that. We are trusting user input here and regexp engines are notorious to have issues. Imho the root cause is trusting user input and that is what probably should be addressed. I also find that a very good reason for change :-). The new, current, commit message says as much now ("untrusted"). This is a workaround still.
I won't stand in the way of the commit, but I stand by my opinion that it is band-aid.
Do you want me to replace this with |
52966de
to
65da3e0
Compare
I prefer explicit over implicit. 're2' is not exactly the same as |
The standard regexp library can consume > O(n) in certain circumstances. The re2 library does not have this issue.
65da3e0
to
353df22
Compare
@bolkedebruin Commit message has been updated with your suggestion. Explicit re2 import are now made when re2 is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Some tests failing @pierrejeambrun |
Yep, one mock needed to be adjusted. Should be green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @bolkedebruin
cc: @bolkedebruin -> I captured your concerns about "band-aid" in #32360 - described the reasonign and consequences, and marked it as "involves core breaking changes" issue. We have started recently to mark issues with such label to see much more easily what kind of changes we might consider when deciding if/when we release Airflow 3 with breaking changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for addressing @pierrejeambrun and thanks for noting the concern @potiuk
The standard regexp library can consume > O(n) in certain circumstances. The re2 library does not have this issue. (cherry picked from commit ee38382)
So a side effect of that is that I’ll see if i can implement our requirements (dags are prefixed with At the very least i think the documentation should mention that the regex implementation is re2, so users have an easier time checking their patterns are correct. |
Feel free to make some docs update on that. PRs to improve our documentation are most welcome. Yes it is possible that security issues will introduce breaking behaviour (because security has higer priority than compatibility) and we generally try to make sure that it is reflected in docs and release notes https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html so if you have the right wording and idea how to communicate it better - absolutely, by all means submit a PR. Airlfow is created by 2700 contributors - such clarifications and updates are always welcome from new contributors - showing that the contributors want to give back to the community and it's generally cool if our contributors care about other users. Ping me in a PR you will open for that, I am always happy to review and merge those kinds of PRS. |
Sure :) let me know if it’s fine or needs touching up. edit: out of habit i made a markdown link, not rst, fixed ✅ |
Replace re with re2 in core. Leave re for dev/scripts/test and providers.
(Providers don't have re2 dependencies).
CAMELCASE_TO_SNAKE_CASE_REGEX
needs to be rewritten without using lookaround. I was not successful at that, so allowingre
here.