-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-6040] Fix KubernetesJobWatcher Read time out error #6643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM thank you for catching this! Please fix the flake8 issues and once tests pass I'll gladly merge :)
Codecov Report
@@ Coverage Diff @@
## master #6643 +/- ##
=========================================
Coverage ? 84.31%
=========================================
Files ? 676
Lines ? 38353
Branches ? 0
=========================================
Hits ? 32338
Misses ? 6015
Partials ? 0
Continue to review full report at Codecov.
|
Perhaps settings this as a configuration value, and falling back to a constant in case it doesn't exist? |
watcher = watch.Watch() | ||
|
||
kwargs = {'label_selector': 'airflow-worker={}'.format(worker_uuid)} | ||
kwargs = {'label_selector': 'airflow-worker={}'.format(worker_uuid), | ||
'timeout_seconds': 50} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be a config variable at the least.
Also it would be good if
airflow/airflow/config_templates/default_airflow.cfg
Lines 782 to 787 in df35957
# **kwargs parameters to pass while calling a kubernetes client core_v1_api methods from Kubernetes Executor | |
# provided as a single line formatted JSON dictionary string. | |
# List of supported params in **kwargs are similar for all core_v1_apis, hence a single config variable for all apis | |
# See: | |
# https://raw.githubusercontent.com/kubernetes-client/python/master/kubernetes/client/apis/core_v1_api.py | |
kube_client_request_args = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, we already use the config option two lines down.
Someone on slack mentioned that this won't work but I don't see why from the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If kube_client_request_args
is used the Kubernetes executor fails to kick off tasks and the scheduler throws this exception:
[2019-11-25 18:02:53,397] {scheduler_job.py:1352} ERROR - Exception when executing execute_helper
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1350, in _execute
self._execute_helper()
File "/usr/local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1439, in _execute_helper
self.executor.heartbeat()
File "/usr/local/lib/python3.7/site-packages/airflow/executors/base_executor.py", line 136, in heartbeat
self.sync()
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 801, in sync
self.kube_scheduler.run_next(task)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 456, in run_next
self.launcher.run_pod_async(pod, **self.kube_config.kube_client_request_args)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 62, in run_pod_async
resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
(data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6148, in create_namespaced_pod_with_http_info
" to method create_namespaced_pod" % key
TypeError: Got an unexpected keyword argument 'timeout_seconds' to method create_namespaced_pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the correct argument name here is _request_timeout
. I can't link the generated python API file as it is too large for github, but it's on line 6141 of https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api/core_v1_api.py.
This doc link for the kube_client_request_args is dead. It also states;
List of supported params in **kwargs are similar for all core_v1_apis, hence a single config variable for all apis
I feel like this is the wrong approach, as these setting should be configurable on a per request basis, but that's another matter and much more complex. For example this label_selector argument would fail if passed to the create_namespaced_pod
function.
The My approach assumes that If we didn't want to "hard-code" this value here, the other approaches I see are:
P.S.: The Worker UUID seems to not persist and is created at runtime. If I follow correctly, this get generated each time the scheduler runs. How is this tracked across restarts? |
There's airflow.models.kubernetes.KubeWorkerIdentifier table with a singleton row where it should be stored. |
@maxirus I would favour either 1. or 4. for simplicity. |
Leaving it hard-coded is going to break it if someone changes the default _request_timeout from 60s.
|
@ashb is this |
Same! I'm using this environment variable as a work-around for now:
|
I can implement it like this but do we really want the timeout to always be |
🤦♂ No not at all. Broken logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible it would be nice to add some unit tests too.
for key, value in kube_config.kube_client_request_args.iteritems(): | ||
kwargs[key] = value | ||
conn_timeout = kube_config.kube_client_request_args.get('_request_timeout', [60, 60])[0] | ||
kwargs['timeout_seconds'] = conn_timeout - 1 if conn_timeout - 1 > 0 else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs['timeout_seconds'] = conn_timeout - 1 if conn_timeout - 1 > 0 else 1 | |
kwargs['timeout_seconds'] = max(conn_timeout - 1, 1) |
(Assuming they are integers.)
Will try to get to this later this week. |
@maxirus Any progress? |
@mbelang No. Between the holidays & work I have not had time. Hoping to have some time this weekend. |
this mitigated the problem at least :)
What is the default timeout currently? |
Syncing upstream
Sooo the test framework has a high barrier to entry and there doesn't seem to be any existing tests for the |
If `_request_timeout` is neither an int, nor a 2-tuple, it is swallowed without further notice which is a rather unfortunate because the level developers would have to look for this issue is pretty deep. This actually leads to confusion already, see apache/airflow#6643 (comment) While it would break backwards compatibility to raise an exception, we should at least warn the developer.
@mbelang Unfortunately, your mitigation disables the timeout completely, see kubernetes-client/python#1069. Passing a string there makes the timeout disappear and lets the scheduler process wait forever. I have not checked deep enough to find out if that is a real problem or not though. |
The kube tests in particular are the hardest to test, yes :( |
Was this resolved? Setting AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS: '{ "_request_timeout": "50" }' did not resolve our issue with KubernetesJobWatcher |
same here. |
Hi, has there been any follow-up on this? This is really a blocker, as it seems like the KubernetesExecutor is completely broken, with no workaround. |
I just tested the fix presented here and I confirm it does work. Was this PR closed only because of the lack of a unit test? |
I found this guide very useful for those setting up Airflow on Kubernetes executor for the first time https://github.com/stwind/airflow-on-kubernetes |
@pvcnt Please check out my last comment: The fix does not work as intended. The Kubernetes job watcher is a process that is solely looping over the Kubernetes API, waiting for changes of Pods. That's all it does. If it crashes with a timeout or gets restarted virtually does not matter. What you are seeing is false alarm. The fix should rather be that the timeout is gracefully handled. I will provide a patch when I find some time for doing it. Again: The issue here does not impair the operations of the KubernetesExecutor. |
@sbrandtb From what I understood it is exactly the purpose to set a server-side timeout and handle it gracefully, instead of relying on a client-side timeout that triggers an exception, isn't it? What other approach do you propose? I had several issues going on at the same time in my cluster, so maybe what I was observing what caused by another issue (solved since). But still, in the current state there is a logs pollution that makes much more difficult to identify a real problem. |
@sbrandtb You shouldn't catch (and subsequently ignore) a connection/response timeout error. Setting @ashb @dimberman I would suggest taking another look at PR #7616 |
@maxirus Sorry, my bad. I did not see in fact that you were setting However, I still disagree with you setting the Either:
Because, if the request timeout from settings is something else than But in general I agree that setting the |
@sbrandtb I think you should take another look at the PR and read the comments in this thread again. My PR doesn't change the
Where am I setting this?
Nope... It's been configurable for a number of releases now and I didn't set this default value.
Yep.
Again, read the comments please. That is not how the maintainers wanted to handle it (see here)
Where's the double default? |
I wanna know how to fix it right now, from all above viewpoints, we know that we need to pass a def _run(self, kube_client, resource_version, worker_uuid, kube_config):
self.log.info(
'Event: and now my watch begins starting at resource_version: %s',
resource_version
)
watcher = watch.Watch()
kwargs = {'label_selector': 'airflow-worker={}'.format(worker_uuid)}
if resource_version:
kwargs['resource_version'] = resource_version
if kube_config.kube_client_request_args:
for key, value in kube_config.kube_client_request_args.items():
kwargs[key] = value
last_resource_version = None
for event in watcher.stream(kube_client.list_namespaced_pod, self.namespace,
**kwargs):
task = event['object']
self.log.info(
'Event: %s had an event of type %s',
task.metadata.name, event['type']
)
if event['type'] == 'ERROR':
return self.process_error(event)
self.process_status(
task.metadata.name, task.status.phase, task.metadata.labels,
task.metadata.resource_version
)
last_resource_version = task.metadata.resource_version
return last_resource_version guess change However, what i wonder is this problem arise from apache-airflow |
Would it be possible to re-open this PR and consider applying this fix (or a similar one)? This issue is still present in the latest release of Airflow (scheduler logs are polluted with ReadTimeoutError), and setting timeout_seconds is a fix that works. |
Jira
Description
will cause a warning instead of an exception when a worker_uuid does not exist. timeout_seconds targets the list_namespaced_pod method as opposed to the underlying urllib3 library which throws an exception.
Tests
Commits
Documentation