-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find Pod Before Cleanup In KubernetesPodOperator Execution #22092
Find Pod Before Cleanup In KubernetesPodOperator Execution #22092
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
81c823f
to
7b2f11a
Compare
I am planning to release |
839de3f
to
f934912
Compare
@potiuk @jedcunningham Is it possible to get approval to run all the CI workflows? |
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
@jedcunningham Do I need to update the helm chart tests? |
Yeah, it looks like those will need some attention. Hopefully you can reproduce following these instructions: |
a77893e
to
ac30bee
Compare
@jedcunningham Is it possible to get the CI to run again? I updated the helm chart tests |
Hey @michaelmicheal thanks for this PR. I think i understand the issue now. I think this solution is a little indirect. The reason that we want to skip deletion is, it tried to create a pod but one with that name already exists. But your "skip deletion" logic is "can't find pod". But there is a pod there.... it just seems like we can tighten it up a little bit. The other issue is you make a backward-incompatible signature change to Here's what I would propose. When we attempt to create and the pod exists we get an ApiException object {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'pods "test-kubernetes-pod" already exists', 'reason': 'AlreadyExists', 'details': {'name': 'test-kubernetes-pod', 'kind': 'pods'}, 'code': 409} so, what we could do is, in our What do you think? |
@dstandish Any suggestions on how I should pass the exception or tell |
so to do this sort of thing i think you have to create a variable outside the scope of the try exc = None
try:
...
except Exception as e:
exc = e
finally:
self.cleanup(
pod=self.pod or self.pod_request_obj,
remote_pod=remote_pod,
exc=exc,
) then you'd want to have some logic in cleanup to evaluate the exc and skip delete in that scenario. i would not mess with stepping back, i realize the difference between |
Coincidentally, I just encountered a different issue where we get 409 error. In that case, we were trying to patch the pod based on an outdated pod object and got this error response:
So if we go the error parsing route, we just have to make sure we're being targeted enough (i.e. just looking for code 409 is not sufficient but we must also verify it's a pod already exists scenario. |
I think the argument for finding the pod before cleanup is that it assures that a pod exists before attempting to delete it. This works not only for a specific edge case (like the situation where it tried to create a pod but one with that name already exists), but any situation in which the pod doesn't exist. I'm happy to implement your proposed solution @dstandish, but what do you think? |
I'm ok with it. Just try to document intention with comment and test |
ok actually... i think there's a simpler way to fix this. when we are calling then it will only delete a pod that it has found already. that will solve your issue. wdyt? this is similar to #23676. maybe we also add a |
Makes sense to me. If we're calling |
because to do that we have to change the signature of |
oh you mention also the option of putting it in finally. i guess putting it in i think maybe ideally |
Fair enough, makes sense to me. I'll move the |
yeah that sounds good to me |
@michaelmicheal there are conflicts :( |
83410f7
to
d0fbe08
Compare
@dstandish @eladkal I resolved the conflicts, could I get the CI workflow to run? |
@@ -428,16 +430,18 @@ def cleanup(self, pod: k8s.V1Pod, remote_pod: k8s.V1Pod): | |||
with _suppress(Exception): | |||
for event in self.pod_manager.read_pod_events(pod).items: | |||
self.log.error("Pod Event: %s - %s", event.reason, event.message) | |||
with _suppress(Exception): | |||
self.process_pod_deletion(pod) | |||
if remote_pod is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we care, but if the create succeeds but the find fails, we can leave the pod with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fair to assume that that if the pod find
fails then the pod doesn’t exist and we don’t need to delete it?
I've kicked CI off for you. |
Looks green @dstandish @jedcunningham :) |
941e969
to
39a0fdc
Compare
39a0fdc
to
8856aa8
Compare
@potiuk @dstandish @jedcunningham Do I need to make any other changes or is this PR good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small changes. sorry i had a half completed review that was just sitting there.
tests/providers/cncf/kubernetes/operators/test_kubernetes_pod.py
Outdated
Show resolved
Hide resolved
@dstandish I added the pod is None check to |
@jedcunningham any suggestions for changes? |
As outlined in this issue, running multiple KubernetesPodOperators with
random_name_suffix=False
andis_delete_pod_operator=True
leads to'my_pod'
for example)'my_pod'
already exists'my_pod'
, which is the pod from the first task.Ideally the second tasks shouldn't delete the pod from the first task, so I added a check to make sure a task's pod exists with the
find_pod
method before calling thecleanup
function (which handles the deletion of the pod).Validation
To reproduce the issue and validate this change I ran two dag runs of the following DAG at the same time.