Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support encryption for triggers parameters #36492

Merged
merged 6 commits into from
Jan 9, 2024

Conversation

hussein-awala
Copy link
Member

This PR implements a way to encrypt some of the trigger parameters using fernet before storing them in the database.

potiuk
potiuk previously approved these changes Dec 30, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but let's wait for others to review as well.

Copy link
Collaborator

@dirrao dirrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.
Can't we access the variables and connections from the trigger run?
What is the recommended pattern for accessing the sensitive information from the trigger run?

@potiuk
Copy link
Member

potiuk commented Dec 30, 2023

Nice work. Can't we access the variables and connections from the trigger run? What is the recommended pattern for accessing the sensitive information from the trigger run?

We can - when we instantiate trigger, not when it runs (when it runs it runs in async-io event loop and it can only access disk, networking and any other resources via asynchronous calls - which means that accessign them has to follow async/await semantics. And When Trigger is deserialized by the Triggerer, the deserializing does not run regular __init__ calll on the Trigger class - it just deserializes it from the DB.

And yes - we could potentially implement some _post_deserialize call to retrieve sensitive information from secrets or variables - however in a number of cases this means that Triggerer would have to access some authentication information to retrieve it. For example it would have to have the same "account" credentials that worker has or sometimes it might mean that it would have to have access to (say) same kubernetes config or .aws or .gcs authentication in order to retrieve that data.

While it is possible, it adds additional requirement on triggerer process - it would have to have the same authentication information set in its environment as worker has. But this is not necessary in most cases - as long as "serialization/deserialization" has all the necessary information.

@dirrao
Copy link
Collaborator

dirrao commented Dec 30, 2023

Nice work. Can't we access the variables and connections from the trigger run? What is the recommended pattern for accessing the sensitive information from the trigger run?

We can - when we instantiate trigger, not when it runs (when it runs it runs in async-io event loop and it can only access disk, networking and any other resources via asynchronous calls - which means that accessign them has to follow async/await semantics. And When Trigger is deserialized by the Triggerer, the deserializing does not run regular __init__ calll on the Trigger class - it just deserializes it from the DB.

And yes - we could potentially implement some _post_deserialize call to retrieve sensitive information from secrets or variables - however in a number of cases this means that Triggerer would have to access some authentication information to retrieve it. For example it would have to have the same "account" credentials that worker has or sometimes it might mean that it would have to have access to (say) same kubernetes config or .aws or .gcs authentication in order to retrieve that data.

While it is possible, it adds additional requirement on triggerer process - it would have to have the same authentication information set in its environment as worker has. But this is not necessary in most cases - as long as "serialization/deserialization" has all the necessary information.

Thanks for the detailed explanation. We usually access secrets from either disk or connections or encrypted variables in airflow. We can access the same in the trigger while it is running the event loop. Check the same in the below S3 Trigger. We usually don't generate new secrets at runtime in the worker and share them with trigger

async with self.hook.async_conn as client:

@potiuk
Copy link
Member

potiuk commented Dec 30, 2023

We usually access secrets from either disk or connections or encrypted variables in airflow

Correct. This is possible, no doubt about it.

But it might quite a bit complex for a number of Triggers. I am not sure how the async_hook works. But if you would like to use Airflow's Connection (either from the DB or from Secrets) I think it would not work (It would need async-aware access to the DB or Secrets). I would love to see how it can work in this case ? Could you please explain it - maybe I am missing something? I believe it uses the built-in async aiobotocore credental/session retrieval but that would not use Airfow's Connections/Secrets? Or am I wrong ?

Also in some cases (like for example when authentication information is coming from externally mounted volumes) is simply not available. Good exmple is when you have a managed service where the authentication information is available on a mounted volume, but the volume is not mounted in Triggerer. . Yes. This is a limitation of this particular service, but it also shows a more general problem where Triggerer would need generally the same auth environment that worker (which might not be desired or possible).

@dirrao
Copy link
Collaborator

dirrao commented Dec 30, 2023

We usually access secrets from either disk or connections or encrypted variables in airflow

Correct. This is possible, no doubt about it.

But it might quite a bit complex for a number of Triggers. I am not sure how the async_hook works. But if you would like to use Airflow's Connection (either from the DB or from Secrets) I think it would not work (It would need async-aware access to the DB or Secrets). I would love to see how it can work in this case ? Could you please explain it - maybe I am missing something? I believe it uses the built-in async aiobotocore credental/session retrieval but that would not use Airfow's Connections/Secrets? Or am I wrong ?

Yes. You are right. I will verify and update you on the DB access.

Also in some cases (like for example when authentication information is coming from externally mounted volumes) is simply not available. Good exmple is when you have a managed service where the authentication information is available on a mounted volume, but the volume is not mounted in Triggerer. . Yes. This is a limitation of this particular service, but it also shows a more general problem where Triggerer would need generally the same auth environment that worker (which might not be desired or possible).

Yes. It make sense.

@hussein-awala hussein-awala dismissed potiuk’s stale review December 30, 2023 12:59

Testing conversation resolution

@hussein-awala
Copy link
Member Author

Merging is blocked
The base branch requires all conversations on code to be resolved.

🎉

@potiuk
Copy link
Member

potiuk commented Dec 30, 2023

And resolving the conversation enables merging. All cool :)

@dirrao
Copy link
Collaborator

dirrao commented Dec 30, 2023

We usually access secrets from either disk or connections or encrypted variables in airflow

Correct. This is possible, no doubt about it.
But it might quite a bit complex for a number of Triggers. I am not sure how the async_hook works. But if you would like to use Airflow's Connection (either from the DB or from Secrets) I think it would not work (It would need async-aware access to the DB or Secrets). I would love to see how it can work in this case ? Could you please explain it - maybe I am missing something? I believe it uses the built-in async aiobotocore credental/session retrieval but that would not use Airfow's Connections/Secrets? Or am I wrong ?

Yes. You are right. I will verify and update you on the DB access.

I have tried accessing the variables from the trigger run function multiple times after multiple context switches in event loop. I can access the variable values without any issues. I am sure we can access the connection details similarly. So, that we can avoid reshaping complete connection details through trigger kw_args.

Also in some cases (like for example when authentication information is coming from externally mounted volumes) is simply not available. Good exmple is when you have a managed service where the authentication information is available on a mounted volume, but the volume is not mounted in Triggerer. . Yes. This is a limitation of this particular service, but it also shows a more general problem where Triggerer would need generally the same auth environment that worker (which might not be desired or possible).

Yes. It make sense.

@potiuk
Copy link
Member

potiuk commented Dec 30, 2023

I have tried accessing the variables from the trigger run function multiple times after multiple context switches in event loop. I can access the variable values without any issues

Yes. Accessing will work. But the fact that you can do it - does not mean you should :).

The problem is that if you do it during the event loop, the access (if not using async/await) will block - in I/O or networking and when it is blocked, no other triggers are running because event loop runs synchronously. See what happens (and warnings you get) when you just "sleep()" for a second or so. Or give your Triggered fractional CPU (say 0.05) - then you will see that even short - 50ms blocking operation will be disastrous for the event loop.

@hussein-awala
Copy link
Member Author

The problem is that if you do it during the event loop, the access (if not using async/await) will block - in I/O or networking and when it is blocked, no other triggers are running because event loop runs synchronously. See what happens (and warnings you get) when you just "sleep()" for a second or so. Or give your Triggered fractional CPU (say 0.05) - then you will see that even short - 50ms blocking operation will be disastrous for the event loop.

I'm running a POC to make this easily possible without blocking the event loop and will create a new AIP soon 🤞

@potiuk
Copy link
Member

potiuk commented Dec 30, 2023

I'm running a POC to make this easily possible without blocking the event loop and will create a new AIP soon 🤞

👀

@dirrao
Copy link
Collaborator

dirrao commented Dec 30, 2023

I have tried accessing the variables from the trigger run function multiple times after multiple context switches in event loop. I can access the variable values without any issues

Yes. Accessing will work. But the fact that you can do it - does not mean you should :).

The problem is that if you do it during the event loop, the access (if not using async/await) will block - in I/O or networking and when it is blocked, no other triggers are running because event loop runs synchronously. See what happens (and warnings you get) when you just "sleep()" for a second or so. Or give your Triggered fractional CPU (say 0.05) - then you will see that even short - 50ms blocking operation will be disastrous for the event loop.

Yes. You are right. It makes even more sense in the case of request/response APIs. However, in airflow, the trigger might run for seconds/minutes/hours (ex: S3KeyTrigger, KubernetesPodTrigger, etc.). I believe accessing the credentials one time should be ok.

@potiuk
Copy link
Member

potiuk commented Dec 30, 2023

Yes. You are right. It makes even more sense in the case of request/response APIs. However, in airflow, the trigger might run for seconds/minutes/hours (ex: S3KeyTrigger, KubernetesPodTrigger, etc.). I believe accessing the credentials one time should be ok.

Not necessarily. This is generally a very bad idea to make a blocking call in event loop. All kinds of problems - even DNS lookup hanging for a second will heavily impact the ability of Triggerer. It is supposed to handle 1000s of triggers running in single even loop. So even if one of the triggers occasionally hangs for a second, it might have enormous impact on performance of trigger.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully and hardly insist on improvements of pytest, take it as "recommondation". Approving in advance ;-D

tests/models/test_trigger.py Outdated Show resolved Hide resolved
@hussein-awala
Copy link
Member Author

hussein-awala commented Jan 9, 2024

I do not fully and hardly insist on improvements of pytest, take it as "recommondation". Approving in advance ;-D

I just updated the test.

Actually, since we test the loaded value:

    loaded_trigger: SensitiveKwargsTrigger = TriggerRunner().trigger_row_to_trigger_instance(
        trigger_row, SensitiveKwargsTrigger
    )
    assert loaded_trigger.param1 == "value1"
    assert loaded_trigger.param2 == "value2"

we don't care how the value is stored, we just need to check that it is not the original one, as you suggested. 👍

@hussein-awala
Copy link
Member Author

The PR was open for two weeks. Since there are no objections to the proposed solution, I will merge it.

@hussein-awala hussein-awala merged commit 8fb55f2 into apache:main Jan 9, 2024
53 checks passed
@dstandish
Copy link
Contributor

Hi, just saw this. Just curious, why not just always encrypt, instead of forcing each implementer to mess with the kwargs naming?

@potiuk
Copy link
Member

potiuk commented Jan 10, 2024

Hi, just saw this. Just curious, why not just always encrypt, instead of forcing each implementer to mess with the kwargs naming?

My take: encrypting all of it makes trigger data opaque and difficult to diagnose in case of problems.

Also In the future optimisation cases it makes it impossible (or very slow) to do some housekeeping. Good example of this is encrypted uses session in users db. We could not delete old sessions without seeing all the metadata there that was encrypted in blob - for example we have not been able to do very efficient "historical session" deletion because session creation time was part of the blob.

The current implementation uses ExtendedJson to store kwargs in the DB - which in modern dbs (even MySQL(!) we could make efficient queries for kwargs. I can easily imagine the use of it for particular types of triggers (Find all deferred KPOs that have been using this namespace for example) . I think - for example - when we go to multitenancy and even (maybe in the future) to multi-multi-tenancy, this kind of queryable metadata might become really useful even for internal Airflow stuff. But even now it can provide valueable information and diagnostics for some power users.

@jscheffl
Copy link
Contributor

Hi, just saw this. Just curious, why not just always encrypt, instead of forcing each implementer to mess with the kwargs naming?

My take: encrypting all of it makes trigger data opaque and difficult to diagnose in case of problems.

Also In the future optimisation cases it makes it impossible (or very slow) to do some housekeeping. Good example of this is encrypted uses session in users db. We could not delete old sessions without seeing all the metadata there that was encrypted in blob - for example we have not been able to do very efficient "historical session" deletion because session creation time was part of the blob.

The current implementation uses ExtendedJson to store kwargs in the DB - which in modern dbs (even MySQL(!) we could make efficient queries for kwargs. I can easily imagine the use of it for particular types of triggers (Find all deferred KPOs that have been using this namespace for example) . I think - for example - when we go to multitenancy and even (maybe in the future) to multi-multi-tenancy, this kind of queryable metadata might become really useful even for internal Airflow stuff. But even now it can provide valueable information and diagnostics for some power users.

Mhm, I agree on "Troubleshooting" encryption is making it harder - but the Picked JSON/Object data anyway can not be queried on DB Level w/o Python.

Consindering that the Triggering information is very volatile (I am not sure but was expecting after a task is completed it is clened-up and does not pile-up?) it would be really reasonable to encrypt just all. Then less code is needed.

@potiuk
Copy link
Member

potiuk commented Jan 10, 2024

Consindering that the Triggering information is very volatile (I am not sure but was expecting after a task is completed it is clened-up and does not pile-up?) it would be really reasonable to encrypt just all. Then less code is needed.

Fair point (but JSON data can be queried in modern DBs)

@potiuk
Copy link
Member

potiuk commented Jan 10, 2024

@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jan 10, 2024
@jscheffl
Copy link
Contributor

https://www.postgresqltutorial.com/postgresql-tutorial/postgresql-json/

..but I assume we don't want to implement N different dialects for all supported DBs - and anyway once we retrieve the data we need to get all the data anyway, not a subset.

ephraimbuddy pushed a commit that referenced this pull request Jan 11, 2024
* Implement a way to encrypt some trigger params stored in the DB

* Update deferring doc

* Create a variable for encryption prefix

* Update the test

(cherry picked from commit 8fb55f2)
@potiuk
Copy link
Member

potiuk commented Jan 14, 2024

Yeah. I am fine with encrypting it all. It does not seem too problemattic and it has the nice property of securing EVERYTHING.

@potiuk
Copy link
Member

potiuk commented Jan 14, 2024

@hussein-awala - what do you think?

abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
* Implement a way to encrypt some trigger params stored in the DB

* Update deferring doc

* Create a variable for encryption prefix

* Update the test
hussein-awala added a commit to hussein-awala/airflow that referenced this pull request Mar 17, 2024
ephraimbuddy added a commit to astronomer/airflow that referenced this pull request Mar 18, 2024
ephraimbuddy added a commit that referenced this pull request Mar 18, 2024
@ephraimbuddy ephraimbuddy removed the type:bug-fix Changelog: Bug Fixes label Mar 18, 2024
pateash pushed a commit to pateash/airflow that referenced this pull request Mar 18, 2024
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Mar 25, 2024
utkarsharma2 pushed a commit to astronomer/airflow that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler area:Triggerer changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) kind:documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants