Handle kubernetes watcher stream disconnection #15500

ephraimbuddy · 2021-04-23T03:37:54Z

Currently, when Kubernetes watch stream times out and we get error 410,
we just return resource version '0' which is not the latest version.

From the documentation, timing out is expected and we should handle it
by performing a list>watch>relist operation so we can continue watching
from the latest resource version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

This PR follows the list>watch>relist pattern

Closes: #15418
This would likely fix some issues #14175, #13916, and #12644 (comment)

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

airflow/executors/kubernetes_executor.py

github-actions · 2021-04-23T19:49:57Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

airflow/executors/kubernetes_executor.py

jedcunningham

A few minor things, but otherwise LGTM.

airflow/executors/kubernetes_executor.py

tests/executors/test_kubernetes_executor.py

github-actions · 2021-04-27T08:14:12Z

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

ashb

So this change introduces a problem.

If we are asking to watch at resource version A (which is too old), and we do a list and get back version A+5 (say) then we have "lost" the events A+1..A+4, i.e. we would never call process_event() for the events in this range.

So we need to think if that is a problem or not

jedcunningham · 2021-04-27T19:52:15Z

we do a list and get back version A+5 (say) then we have "lost" the events A+1..A+4, i.e. we would never call process_event() for the events in this range.

Unfortunately, right now we have that same problem as we (attempt to) rewatch from "0", a.k.a any point in history. Per the docs, the latest is preferred so the behavior is likely pretty similar:
https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter

I think the piece we are missing is a step to reconcile our state against what the list returns, e.g. detect we missed A+1..A+4 and action them. More work needs to be done to harden this area and this is the first step (and this alone doesn't buy us much other than setting us up for being more complete later).

@ephraimbuddy and I will spend some time trying to reproduce this scenario on demand to make sure we are starting down the right path.

kaxil · 2021-05-24T14:04:52Z

@ephraimbuddy Conflicts here

ephraimbuddy · 2021-05-25T07:50:16Z

@ephraimbuddy Conflicts here

Resolved.

ephraimbuddy · 2021-07-20T10:48:14Z

I came across using allow_watch_bookmarks https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks
to get the latest resource version but the python client have it disabled for now https://github.com/kubernetes-client/python-base/pull/234/files .
That would have solved this for us

Currently, when kubernetes watch stream times out and we get error 410, we just return resourve version '0' which is not the latest version. From the documentation, timing out is expected and we should handle it by performing a list>watch>relist operation so we can continue watching from the latest resource version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes This PR follows the list>watch>relist pattern fixup! Handle kubernetes watcher stream disconnection Apply review suggestions and add more tests fixup! Apply review suggestions and add more tests Handle APIException gracefully Resolve conflicts

…at makes us relist pod

kaxil · 2021-11-18T14:09:31Z

@ephraimbuddy Can you remember why we closed this PR -- do we have an alternate solution?

ephraimbuddy · 2021-11-20T07:38:22Z

@ephraimbuddy Can you remember why we closed this PR -- do we have an alternate solution?

I have forgotten why I closed this but it looks related to the above allow_watch_bookmarks that I found.

JeremieDoctrine · 2022-01-10T11:23:37Z

@kaxil did you find an alternate solution. We are also facing stream disconnection.

ecerulm · 2022-05-06T12:52:10Z

I came across using allow_watch_bookmarks https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks
to get the latest resource version but the python client have it disabled for now https://github.com/kubernetes-client/python-base/pull/234/files .
That would have solved this for us

Even with allow_watch_bookmarks it's not guarantee that the kubernetes api would send those bookmark events at all as stated in the link

As a client, you can request BOOKMARK events by setting the allowWatchBookmarks=true query parameter to a watch request, *but you shouldn't assume bookmarks are returned at any specific interval, nor can clients assume that the API server will send any BOOKMARK event even when requested.

So I think the solution is more in the way of #23521 where the last known resource_version is reset to 0 in the event of any exception.

ephraimbuddy requested review from ashb, kaxil, turbaszek and XD-DENG as code owners April 23, 2021 03:37

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Apr 23, 2021

ephraimbuddy force-pushed the relist-pod-on-error branch from ec75607 to 98e9d6f Compare April 23, 2021 08:38

jedcunningham requested changes Apr 23, 2021

View reviewed changes

ashb reviewed Apr 23, 2021

View reviewed changes

ephraimbuddy force-pushed the relist-pod-on-error branch from ccdad89 to 8e37bd6 Compare April 26, 2021 05:56

ephraimbuddy mentioned this pull request Apr 26, 2021

Parse Error 410 in kubernetes Watcher and return latest resource version #15418

Closed

jedcunningham approved these changes Apr 26, 2021

View reviewed changes

ephraimbuddy force-pushed the relist-pod-on-error branch 2 times, most recently from 6710e52 to da5cc72 Compare April 27, 2021 07:56

ephraimbuddy requested a review from ashb April 27, 2021 07:58

ashb requested changes Apr 27, 2021

View reviewed changes

ephraimbuddy force-pushed the relist-pod-on-error branch 2 times, most recently from 526275b to dd283af Compare April 27, 2021 22:14

dimberman approved these changes Apr 28, 2021

View reviewed changes

ephraimbuddy force-pushed the relist-pod-on-error branch from dd283af to a292860 Compare May 25, 2021 06:48

ephraimbuddy force-pushed the relist-pod-on-error branch from a292860 to 729efd4 Compare May 25, 2021 22:04

ephraimbuddy force-pushed the relist-pod-on-error branch 2 times, most recently from a970a93 to 282979e Compare July 19, 2021 08:33

ephraimbuddy added 2 commits July 20, 2021 12:10

Add allow_watch_bookmarks to mitigate against short history window th…

b6c9df1

…at makes us relist pod

remove allow_book_marks as it's not currently supported in python client

e206161

ephraimbuddy force-pushed the relist-pod-on-error branch from 282979e to e206161 Compare July 20, 2021 11:10

ephraimbuddy closed this Sep 20, 2021

ephraimbuddy deleted the relist-pod-on-error branch September 20, 2021 20:07

jedcunningham mentioned this pull request Feb 10, 2022

KubernetesJobWatcher failing on HTTP 410 errors, jobs stuck in scheduled state #21087

Closed

2 tasks

ephraimbuddy mentioned this pull request May 6, 2022

Prevent KubernetesJobWatcher getting stuck on resource too old #23521

Merged

ephraimbuddy mentioned this pull request May 21, 2022

Status of testing of Apache Airflow 2.3.1rc1 #23852

Closed

61 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle kubernetes watcher stream disconnection #15500

Handle kubernetes watcher stream disconnection #15500

ephraimbuddy commented Apr 23, 2021

github-actions bot commented Apr 23, 2021

jedcunningham left a comment

github-actions bot commented Apr 27, 2021

ashb left a comment

jedcunningham commented Apr 27, 2021

kaxil commented May 24, 2021

ephraimbuddy commented May 25, 2021

ephraimbuddy commented Jul 20, 2021 •

edited

Loading

kaxil commented Nov 18, 2021

ephraimbuddy commented Nov 20, 2021

JeremieDoctrine commented Jan 10, 2022

ecerulm commented May 6, 2022

Handle kubernetes watcher stream disconnection #15500

Handle kubernetes watcher stream disconnection #15500

Conversation

ephraimbuddy commented Apr 23, 2021

github-actions bot commented Apr 23, 2021

jedcunningham left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 27, 2021

ashb left a comment

Choose a reason for hiding this comment

jedcunningham commented Apr 27, 2021

kaxil commented May 24, 2021

ephraimbuddy commented May 25, 2021

ephraimbuddy commented Jul 20, 2021 • edited Loading

kaxil commented Nov 18, 2021

ephraimbuddy commented Nov 20, 2021

JeremieDoctrine commented Jan 10, 2022

ecerulm commented May 6, 2022

ephraimbuddy commented Jul 20, 2021 •

edited

Loading