Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle kubernetes watcher stream disconnection #15500

Closed
wants to merge 3 commits into from

Conversation

ephraimbuddy
Copy link
Contributor

Currently, when Kubernetes watch stream times out and we get error 410,
we just return resource version '0' which is not the latest version.

From the documentation, timing out is expected and we should handle it
by performing a list>watch>relist operation so we can continue watching
from the latest resource version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

This PR follows the list>watch>relist pattern

Closes: #15418
This would likely fix some issues #14175, #13916, and #12644 (comment)


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Apr 23, 2021
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Show resolved Hide resolved
Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, but otherwise LGTM.

airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
tests/executors/test_kubernetes_executor.py Outdated Show resolved Hide resolved
tests/executors/test_kubernetes_executor.py Show resolved Hide resolved
@ephraimbuddy ephraimbuddy force-pushed the relist-pod-on-error branch 2 times, most recently from 6710e52 to da5cc72 Compare April 27, 2021 07:56
@ephraimbuddy ephraimbuddy requested a review from ashb April 27, 2021 07:58
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this change introduces a problem.

If we are asking to watch at resource version A (which is too old), and we do a list and get back version A+5 (say) then we have "lost" the events A+1..A+4, i.e. we would never call process_event() for the events in this range.

So we need to think if that is a problem or not

@jedcunningham
Copy link
Member

we do a list and get back version A+5 (say) then we have "lost" the events A+1..A+4, i.e. we would never call process_event() for the events in this range.

Unfortunately, right now we have that same problem as we (attempt to) rewatch from "0", a.k.a any point in history. Per the docs, the latest is preferred so the behavior is likely pretty similar:
https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter

I think the piece we are missing is a step to reconcile our state against what the list returns, e.g. detect we missed A+1..A+4 and action them. More work needs to be done to harden this area and this is the first step (and this alone doesn't buy us much other than setting us up for being more complete later).

@ephraimbuddy and I will spend some time trying to reproduce this scenario on demand to make sure we are starting down the right path.

@ephraimbuddy ephraimbuddy force-pushed the relist-pod-on-error branch 2 times, most recently from 526275b to dd283af Compare April 27, 2021 22:14
@kaxil
Copy link
Member

kaxil commented May 24, 2021

@ephraimbuddy Conflicts here

@ephraimbuddy ephraimbuddy force-pushed the relist-pod-on-error branch from dd283af to a292860 Compare May 25, 2021 06:48
@ephraimbuddy
Copy link
Contributor Author

@ephraimbuddy Conflicts here

Resolved.

@ephraimbuddy ephraimbuddy force-pushed the relist-pod-on-error branch from a292860 to 729efd4 Compare May 25, 2021 22:04
@ephraimbuddy ephraimbuddy force-pushed the relist-pod-on-error branch 2 times, most recently from a970a93 to 282979e Compare July 19, 2021 08:33
@ephraimbuddy
Copy link
Contributor Author

ephraimbuddy commented Jul 20, 2021

I came across using allow_watch_bookmarks https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks
to get the latest resource version but the python client have it disabled for now https://github.com/kubernetes-client/python-base/pull/234/files .
That would have solved this for us

Currently, when kubernetes watch stream times out and we get error 410,
we just return resourve version '0' which is not the latest version.

From the documentation, timing out is expected and we should handle it
by performing a list>watch>relist operation so we can continue watching
from the latest resource version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

This PR follows the list>watch>relist pattern

fixup! Handle kubernetes watcher stream disconnection

Apply review suggestions and add more tests

fixup! Apply review suggestions and add more tests

Handle APIException gracefully

Resolve conflicts
@ephraimbuddy ephraimbuddy force-pushed the relist-pod-on-error branch from 282979e to e206161 Compare July 20, 2021 11:10
@ephraimbuddy ephraimbuddy deleted the relist-pod-on-error branch September 20, 2021 20:07
@kaxil
Copy link
Member

kaxil commented Nov 18, 2021

@ephraimbuddy Can you remember why we closed this PR -- do we have an alternate solution?

@ephraimbuddy
Copy link
Contributor Author

@ephraimbuddy Can you remember why we closed this PR -- do we have an alternate solution?

I have forgotten why I closed this but it looks related to the above allow_watch_bookmarks that I found.

@JeremieDoctrine
Copy link

@kaxil did you find an alternate solution. We are also facing stream disconnection.

@ecerulm
Copy link
Contributor

ecerulm commented May 6, 2022

I came across using allow_watch_bookmarks https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks
to get the latest resource version but the python client have it disabled for now https://github.com/kubernetes-client/python-base/pull/234/files .
That would have solved this for us

Even with allow_watch_bookmarks it's not guarantee that the kubernetes api would send those bookmark events at all as stated in the link

As a client, you can request BOOKMARK events by setting the allowWatchBookmarks=true query parameter to a watch request, *but you shouldn't assume bookmarks are returned at any specific interval, nor can clients assume that the API server will send any BOOKMARK event even when requested.

So I think the solution is more in the way of #23521 where the last known resource_version is reset to 0 in the event of any exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants