-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support encryption for triggers parameters #36492
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - but let's wait for others to review as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
Can't we access the variables and connections from the trigger run?
What is the recommended pattern for accessing the sensitive information from the trigger run?
We can - when we instantiate trigger, not when it runs (when it runs it runs in async-io event loop and it can only access disk, networking and any other resources via asynchronous calls - which means that accessign them has to follow async/await semantics. And When Trigger is deserialized by the Triggerer, the deserializing does not run regular And yes - we could potentially implement some While it is possible, it adds additional requirement on |
Thanks for the detailed explanation. We usually access secrets from either disk or connections or encrypted variables in airflow. We can access the same in the trigger while it is running the event loop. Check the same in the below S3 Trigger. We usually don't generate new secrets at runtime in the worker and share them with trigger
|
Correct. This is possible, no doubt about it. But it might quite a bit complex for a number of Triggers. I am not sure how the Also in some cases (like for example when authentication information is coming from externally mounted volumes) is simply not available. Good exmple is when you have a managed service where the authentication information is available on a mounted volume, but the volume is not mounted in Triggerer. . Yes. This is a limitation of this particular service, but it also shows a more general problem where Triggerer would need generally the same auth environment that worker (which might not be desired or possible). |
Yes. You are right. I will verify and update you on the DB access.
Yes. It make sense. |
Testing conversation resolution
🎉 |
And resolving the conversation enables merging. All cool :) |
I have tried accessing the variables from the trigger run function multiple times after multiple context switches in event loop. I can access the variable values without any issues. I am sure we can access the connection details similarly. So, that we can avoid reshaping complete connection details through trigger kw_args.
|
Yes. Accessing will work. But the fact that you can do it - does not mean you should :). The problem is that if you do it during the event loop, the access (if not using async/await) will block - in I/O or networking and when it is blocked, no other triggers are running because event loop runs synchronously. See what happens (and warnings you get) when you just "sleep()" for a second or so. Or give your Triggered fractional CPU (say 0.05) - then you will see that even short - 50ms blocking operation will be disastrous for the event loop. |
I'm running a POC to make this easily possible without blocking the event loop and will create a new AIP soon 🤞 |
👀 |
Yes. You are right. It makes even more sense in the case of request/response APIs. However, in airflow, the trigger might run for seconds/minutes/hours (ex: S3KeyTrigger, KubernetesPodTrigger, etc.). I believe accessing the credentials one time should be ok. |
Not necessarily. This is generally a very bad idea to make a blocking call in event loop. All kinds of problems - even DNS lookup hanging for a second will heavily impact the ability of Triggerer. It is supposed to handle 1000s of triggers running in single even loop. So even if one of the triggers occasionally hangs for a second, it might have enormous impact on performance of trigger. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully and hardly insist on improvements of pytest, take it as "recommondation". Approving in advance ;-D
I just updated the test. Actually, since we test the loaded value: loaded_trigger: SensitiveKwargsTrigger = TriggerRunner().trigger_row_to_trigger_instance(
trigger_row, SensitiveKwargsTrigger
)
assert loaded_trigger.param1 == "value1"
assert loaded_trigger.param2 == "value2" we don't care how the value is stored, we just need to check that it is not the original one, as you suggested. 👍 |
The PR was open for two weeks. Since there are no objections to the proposed solution, I will merge it. |
Hi, just saw this. Just curious, why not just always encrypt, instead of forcing each implementer to mess with the kwargs naming? |
My take: encrypting all of it makes trigger data opaque and difficult to diagnose in case of problems. Also In the future optimisation cases it makes it impossible (or very slow) to do some housekeeping. Good example of this is encrypted uses session in users db. We could not delete old sessions without seeing all the metadata there that was encrypted in blob - for example we have not been able to do very efficient "historical session" deletion because session creation time was part of the blob. The current implementation uses ExtendedJson to store kwargs in the DB - which in modern dbs (even MySQL(!) we could make efficient queries for kwargs. I can easily imagine the use of it for particular types of triggers (Find all deferred KPOs that have been using this namespace for example) . I think - for example - when we go to multitenancy and even (maybe in the future) to multi-multi-tenancy, this kind of queryable metadata might become really useful even for internal Airflow stuff. But even now it can provide valueable information and diagnostics for some power users. |
Mhm, I agree on "Troubleshooting" encryption is making it harder - but the Picked JSON/Object data anyway can not be queried on DB Level w/o Python. Consindering that the Triggering information is very volatile (I am not sure but was expecting after a task is completed it is clened-up and does not pile-up?) it would be really reasonable to encrypt just all. Then less code is needed. |
Fair point (but JSON data can be queried in modern DBs) |
..but I assume we don't want to implement N different dialects for all supported DBs - and anyway once we retrieve the data we need to get all the data anyway, not a subset. |
* Implement a way to encrypt some trigger params stored in the DB * Update deferring doc * Create a variable for encryption prefix * Update the test (cherry picked from commit 8fb55f2)
Yeah. I am fine with encrypting it all. It does not seem too problemattic and it has the nice property of securing EVERYTHING. |
@hussein-awala - what do you think? |
* Implement a way to encrypt some trigger params stored in the DB * Update deferring doc * Create a variable for encryption prefix * Update the test
This reverts commit 8fb55f2.
This reverts commit 8fb55f2.
…pache#38253) This reverts commit 8fb55f2.
…pache#38253) This reverts commit 8fb55f2.
This PR implements a way to encrypt some of the trigger parameters using fernet before storing them in the database.