-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle kubernetes watcher stream disconnection #15500
Conversation
ec75607
to
98e9d6f
Compare
The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*. |
ccdad89
to
8e37bd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things, but otherwise LGTM.
6710e52
to
da5cc72
Compare
The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this change introduces a problem.
If we are asking to watch at resource version A (which is too old), and we do a list and get back version A+5 (say) then we have "lost" the events A+1..A+4, i.e. we would never call process_event()
for the events in this range.
So we need to think if that is a problem or not
Unfortunately, right now we have that same problem as we (attempt to) rewatch from "0", a.k.a any point in history. Per the docs, the latest is preferred so the behavior is likely pretty similar: I think the piece we are missing is a step to reconcile our state against what the list returns, e.g. detect we missed A+1..A+4 and action them. More work needs to be done to harden this area and this is the first step (and this alone doesn't buy us much other than setting us up for being more complete later). @ephraimbuddy and I will spend some time trying to reproduce this scenario on demand to make sure we are starting down the right path. |
526275b
to
dd283af
Compare
@ephraimbuddy Conflicts here |
dd283af
to
a292860
Compare
Resolved. |
a292860
to
729efd4
Compare
a970a93
to
282979e
Compare
I came across using |
Currently, when kubernetes watch stream times out and we get error 410, we just return resourve version '0' which is not the latest version. From the documentation, timing out is expected and we should handle it by performing a list>watch>relist operation so we can continue watching from the latest resource version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes This PR follows the list>watch>relist pattern fixup! Handle kubernetes watcher stream disconnection Apply review suggestions and add more tests fixup! Apply review suggestions and add more tests Handle APIException gracefully Resolve conflicts
282979e
to
e206161
Compare
@ephraimbuddy Can you remember why we closed this PR -- do we have an alternate solution? |
I have forgotten why I closed this but it looks related to the above |
@kaxil did you find an alternate solution. We are also facing stream disconnection. |
Even with
So I think the solution is more in the way of #23521 where the last known resource_version is reset to 0 in the event of any exception. |
Currently, when Kubernetes watch stream times out and we get error 410,
we just return resource version '0' which is not the latest version.
From the documentation, timing out is expected and we should handle it
by performing a list>watch>relist operation so we can continue watching
from the latest resource version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
This PR follows the list>watch>relist pattern
Closes: #15418
This would likely fix some issues #14175, #13916, and #12644 (comment)
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.