-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683
Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683
Conversation
Hi, |
@egmar Yes, we're trying to figure that out. Hopefully along with ansible/awx#11338 |
3ce8490
to
7bfc749
Compare
I can confirm, that testing job keep running whole night. ---
- hosts: "{{ target|default('localhost') }}"
become: false
gather_facts: false
tasks:
- name: Run Job
ansible.builtin.shell: 'while true ; do date ; sleep 3600; done'
register: long_output_job
ignore_errors: true
Good job! |
@Vitexus You are awesome, thanks for testing. We iterated a bit more. Can you try with the latest code here? |
I found another bug where it breaks when very long lines are emitted. Will try to figure out a solution later today. |
Well... it's not a bug in this code 😰 I have a pod that starts and runs this:
When running For whatever reason, there are only 16385 bytes printed, followed by a timestamp, then more bytes. Since AWX expects each line to be either in-tact json or a full base64 encoded blob of zip data, this will definitely break certain workloads |
Maybe related: kubernetes/kubernetes#77822 |
@Vitexus btw love your github bio |
The problem seems to be even worse in OpenShift. There's only 4097 bytes between the timestamps... |
Ok, next try with latest changes tomorrow ... |
latest idea is to just use local time to resume the log the implementation if the last few line of job output gets repeated this cause a problem with gonna pivot to #685 to leverage socat and portfowarding to stream the log back via TCP |
Hao & Seth, can you please update this and get it tested against kubernetes/kubernetes#113481 Thanks |
good news, we were able to convince Kubernetes maintainer that the log issue @shanemcd discover is indeed a bug 🥳 detail of the fix is in kubernetes/kubernetes#113481 with that we can proceed with flushing this PR out now we should be able to use exact log timestamp to prevent duplication of log messages and resume from the specific message at the time of disconnection |
@Vitexus thanks for caring about this topic and helping us out with this issue, we really appreciate it. Have you heard about our community Matrix chatroom https://matrix.to/#/#awx:ansible.com. Its a great place for collaborating with the AWX community, join us! |
pkg/workceptor/kubernetes.go
Outdated
if err == nil { | ||
break | ||
} else { | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine, this wasn't quite long enough to recover gracefully when the kubelet is restarted. Changing this to 1 second made it work. I see where we changed 4 other places from 5 seconds to 100 milliseconds. Let's go with somewhere in the middle... maybe 1 or 2 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.warn here too.
got a test image up |
kubernetes/kubernetes#113481 merged, in our logic we do need to deal with and preserve the old behavior (where job fail after 4hr) if k8s does not contain the logging fix so if anyone have suggestion for how we can deal with that gracefully please comment we might need a external param to detect if the deployment is on a kube with the patch and enable/disable timestamp accordingly |
I'd use the https://pkg.go.dev/k8s.io/client-go/discovery#DiscoveryClient.ServerVersion and fallback to old behaviour on K8S without the fix. |
Hello, |
Could you show your custom execution environment build? I tried using the quay.io/ansible/awx-ee:latest execution environment and it still timed out after 4hrs. |
I never experienced the 4h timeout (because I don´t have such long jobs), but I was experiencing the issue with the log rotation a lot. The patch was added in version 1.3.0 |
As seen in ansible/awx#11338 and ansible/receptor#446 - Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683 - Pump up timeouts thereof
Due to a issue in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed on Kubernetes. More context on that in ansible/awx#11805.
This PR adds logic that will pick back up from the last line we saw, using Kubernetes log timestamp
require fix in Kubernetes for Pod logs: long lines are corrupted when using timestamps=true
fixed in kubernetes/kubernetes#113481
Fixe backported into Kubernetes release branches in the following PRs:
release-1.23 (1.23.14)
release-1.24 (1.24.8)
release-1.25 (1.25.4)
Fixes ported to OpenShift in the following PRs
release-4.9 (4.9.x) (not yet merged)
release-4.10 (4.10.42)
release-4.11 (4.11.16)
release-4.12 (4.12.0)
the fix in this PR should detect the Kubernetes version and use
--timestamp
according however due to the "wild wild west" nature of Kubernetes world we addedRECEPTOR_KUBE_SUPPORT_RECONNECT
environment variable to force enable/disable the fix.RECEPTOR_KUBE_SUPPORT_RECONNECT
have following options:this flag can be set via awx custom resource
NOTE:
RECEPTOR_KUBE_SUPPORT_RECONNECT
will bypass check for ALL container group when set to "enabled/disabled". If a specific container group is not contain have the right Kubernetes version andRECEPTOR_KUBE_SUPPORT_RECONNECT
is set to enabled job execution with that container group will fail due to corrupted log streamNOTE: it is also possible for kublet version to be different from kube-apiserver detecting the presence of the fix using kube-apiserver version is not the safest option, but its the only option we have here.