Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683

TheRealHaoLiu · 2022-10-25T21:33:23Z

Due to a issue in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed on Kubernetes. More context on that in ansible/awx#11805.

This PR adds logic that will pick back up from the last line we saw, using Kubernetes log timestamp

require fix in Kubernetes for Pod logs: long lines are corrupted when using timestamps=true
fixed in kubernetes/kubernetes#113481

Fixe backported into Kubernetes release branches in the following PRs:
release-1.23 (1.23.14)
release-1.24 (1.24.8)
release-1.25 (1.25.4)

Fixes ported to OpenShift in the following PRs
release-4.9 (4.9.x) (not yet merged)
release-4.10 (4.10.42)
release-4.11 (4.11.16)
release-4.12 (4.12.0)

the fix in this PR should detect the Kubernetes version and use --timestamp according however due to the "wild wild west" nature of Kubernetes world we added RECEPTOR_KUBE_SUPPORT_RECONNECT environment variable to force enable/disable the fix.

RECEPTOR_KUBE_SUPPORT_RECONNECT have following options:

“enabled”: this option will use timestamp with the log and enable our new code path
“disabled”: this option will not use timestamp and use the original code path
“auto”: auto detect if it's appropriate to enable timestamp base on kube version

this flag can be set via awx custom resource

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  …
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled

NOTE: RECEPTOR_KUBE_SUPPORT_RECONNECT will bypass check for ALL container group when set to "enabled/disabled". If a specific container group is not contain have the right Kubernetes version and RECEPTOR_KUBE_SUPPORT_RECONNECT is set to enabled job execution with that container group will fail due to corrupted log stream

NOTE: it is also possible for kublet version to be different from kube-apiserver detecting the presence of the fix using kube-apiserver version is not the safest option, but its the only option we have here.

egmar · 2022-10-26T11:30:56Z

Hi,
is this PR also covering the issue described here ansible/awx#11805?

shanemcd · 2022-10-26T13:12:07Z

@egmar Yes, we're trying to figure that out. Hopefully along with ansible/awx#11338

Vitexus · 2022-10-27T07:02:50Z

I can confirm, that testing job keep running whole night.

---
- hosts: "{{ target|default('localhost') }}"
  become: false
  gather_facts: false
  tasks:
   - name: Run Job
     ansible.builtin.shell: 'while true ; do date ; sleep 3600; done'
     register: long_output_job
     ignore_errors: true

Good job!
Many Thanks to @TheRealHaoLiu and looking forward The release 1.2.4.

shanemcd · 2022-10-27T12:42:50Z

@Vitexus You are awesome, thanks for testing. We iterated a bit more. Can you try with the latest code here?

Vitexus · 2022-10-27T13:39:13Z

Sure.

shanemcd · 2022-10-27T14:28:41Z

I found another bug where it breaks when very long lines are emitted. Will try to figure out a solution later today.

shanemcd · 2022-10-27T15:11:18Z

Well... it's not a bug in this code 😰

I have a pod that starts and runs this:

for i in {1..30}; do echo $i; printf "%0.s-" {1..100000}; echo; done

When running $ kubectl logs echo-tcfjg --timestamps, I can see random timestamps in the middle of the output:

For whatever reason, there are only 16385 bytes printed, followed by a timestamp, then more bytes.

Since AWX expects each line to be either in-tact json or a full base64 encoded blob of zip data, this will definitely break certain workloads ☹️

shanemcd · 2022-10-27T15:15:07Z

Maybe related: kubernetes/kubernetes#77822

TheRealHaoLiu · 2022-10-27T15:24:07Z

@Vitexus btw love your github bio
are u on our matrix server by any chance? https://matrix.to/#/#awx:ansible.com (there may be a rare IRL angry spud plushy in it for you 😉 )

shanemcd · 2022-10-27T15:45:48Z

The problem seems to be even worse in OpenShift. There's only 4097 bytes between the timestamps...

Vitexus · 2022-10-31T14:22:00Z

Sad 👎

Vitexus · 2022-10-31T20:43:42Z

Ok, next try with latest changes tomorrow ...

TheRealHaoLiu · 2022-10-31T21:01:37Z

latest idea is to just use local time to resume the log the implementation
this implementation have the problem that we may get repeated lines in the log

if the last few line of job output gets repeated this cause a problem with ansible-runner process and could cause ansible-runner to hang forever

gonna pivot to #685 to leverage socat and portfowarding to stream the log back via TCP

gundalow · 2022-11-01T19:08:26Z

Hao & Seth, can you please update this and get it tested against kubernetes/kubernetes#113481

Thanks

TheRealHaoLiu · 2022-11-01T20:25:00Z

good news, we were able to convince Kubernetes maintainer that the log issue @shanemcd discover is indeed a bug 🥳

detail of the fix is in kubernetes/kubernetes#113481

with that we can proceed with flushing this PR out

now we should be able to use exact log timestamp to prevent duplication of log messages and resume from the specific message at the time of disconnection

TheRealHaoLiu · 2022-11-02T14:04:54Z

@Vitexus thanks for caring about this topic and helping us out with this issue, we really appreciate it. Have you heard about our community Matrix chatroom https://matrix.to/#/#awx:ansible.com. Its a great place for collaborating with the AWX community, join us!

shanemcd · 2022-11-06T18:14:55Z

pkg/workceptor/kubernetes.go

+				if err == nil {
+					break
+				} else {
+					time.Sleep(100 * time.Millisecond)


On my machine, this wasn't quite long enough to recover gracefully when the kubelet is restarted. Changing this to 1 second made it work. I see where we changed 4 other places from 5 seconds to 100 milliseconds. Let's go with somewhere in the middle... maybe 1 or 2 seconds.

logger.warn here too.

TheRealHaoLiu · 2022-11-07T15:44:33Z

got a test image up quay.io/haoliu/awx-ee:kubernetes-logstream-retry (will continue to update to track the latest change)

TheRealHaoLiu · 2022-11-07T15:46:34Z

kubernetes/kubernetes#113481 merged, in our logic we do need to deal with and preserve the old behavior (where job fail after 4hr) if k8s does not contain the logging fix so if anyone have suggestion for how we can deal with that gracefully please comment

we might need a external param to detect if the deployment is on a kube with the patch and enable/disable timestamp accordingly

egmar · 2022-11-07T18:13:28Z

kubernetes/kubernetes#113481 merged, in our logic we do need to deal with and preserve the old behavior (where job fail after 4hr) if k8s does not contain the logging fix so if anyone have suggestion for how we can deal with that gracefully please comment

we might need a external param to detect if the deployment is on a kube with the patch and enable/disable timestamp accordingly

I'd use the https://pkg.go.dev/k8s.io/client-go/discovery#DiscoveryClient.ServerVersion and fallback to old behaviour on K8S without the fix.

oliverf1 · 2023-01-18T01:58:11Z

Hello,
I can also confirm that this fixed the issue for me :-) Thank you all for the fix!
However, I had to build a custom image for the execution environment because in the Containerfile of quay.io/ansible/awx-ee:latest it references quay.io/ansible/receptor:devel but this patch is only available in the latest image
So is there any chance that the devel image being updated with this 1.3.0 version, It's currently using the 1.3.0.dev2 which is older? Or should I create a MR in the ansible/awx-ee repo to update the Containerfile to target quay.io/ansible/receptor:latest?

kingtutt1906 · 2023-02-04T13:10:54Z

Hello, I can also confirm that this fixed the issue for me :-) Thank you all for the fix! However, I had to build a custom image for the execution environment because in the Containerfile of quay.io/ansible/awx-ee:latest it references quay.io/ansible/receptor:devel but this patch is only available in the latest image So is there any chance that the devel image being updated with this 1.3.0 version, It's currently using the 1.3.0.dev2 which is older? Or should I create a MR in the ansible/awx-ee repo to update the Containerfile to target quay.io/ansible/receptor:latest?

Could you show your custom execution environment build? I tried using the quay.io/ansible/awx-ee:latest execution environment and it still timed out after 4hrs.

oliverf1 · 2023-02-20T04:44:16Z

Hello, I can also confirm that this fixed the issue for me :-) Thank you all for the fix! However, I had to build a custom image for the execution environment because in the Containerfile of quay.io/ansible/awx-ee:latest it references quay.io/ansible/receptor:devel but this patch is only available in the latest image So is there any chance that the devel image being updated with this 1.3.0 version, It's currently using the 1.3.0.dev2 which is older? Or should I create a MR in the ansible/awx-ee repo to update the Containerfile to target quay.io/ansible/receptor:latest?

Could you show your custom execution environment build? I tried using the quay.io/ansible/awx-ee:latest execution environment and it still timed out after 4hrs.

I never experienced the 4h timeout (because I don´t have such long jobs), but I was experiencing the issue with the log rotation a lot.
The patch is now available in the quay.io/ansible/awx-ee:latest default image, No need to rebuild an image
$ docker run -it quay.io/ansible/awx-ee:latest receptor --version
1.3.1+g3c0278f

The patch was added in version 1.3.0

As seen in ansible/awx#11338 and ansible/receptor#446 - Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683 - Pump up timeouts thereof

fosterseth mentioned this pull request Oct 26, 2022

502 Bad Gateway Errors in AWX GUI browsing Jobs, Templates ansible/awx#12644

Closed

9 tasks

shanemcd force-pushed the kubernetes-logstream-retry branch from 3ce8490 to 7bfc749 Compare October 26, 2022 20:21

shanemcd mentioned this pull request Oct 26, 2022

Pod logs stop being pulled when container log files are rotated #446

Open

thees approved these changes Oct 27, 2022

View reviewed changes

shanemcd mentioned this pull request Oct 31, 2022

[PoC] Introduce new port-forward streaming mechanism for Kubernetes work #685

Closed

TheRealHaoLiu closed this Oct 31, 2022

gundalow reopened this Nov 1, 2022

TheRealHaoLiu changed the title ~~add retry logic to receptor logstream on kube~~ [WIP] add retry logic to receptor logstream on kube Nov 3, 2022

shanemcd reviewed Nov 6, 2022

View reviewed changes

shanemcd mentioned this pull request Nov 6, 2022

Sleep 650 job throwing error within 5 minutes of its launch ansible/awx#13161

Open

9 tasks

TheRealHaoLiu mentioned this pull request Nov 7, 2022

AWX Office Hours Agenda - Nov 8 2022 ansible/awx#13155

Closed

TheRealHaoLiu mentioned this pull request Nov 28, 2022

AWX Office Hours Agenda - Dec 13 2022 ansible/awx#13240

Closed

fosterseth mentioned this pull request Nov 29, 2022

Detect OCP version to determine reconnect support #693

Closed

TheRealHaoLiu deleted the kubernetes-logstream-retry branch December 2, 2022 14:26

fosterseth mentioned this pull request Dec 2, 2022

Playbooks terminated unexpectedly after 4 hours ansible/awx#11805

Closed

6 tasks

thees mentioned this pull request Dec 13, 2022

reconnect issue in gke version detection #708

Closed

This was referenced Dec 14, 2022

Playbooks running longer than 4 hours are terminated unexpectedly ansible/awx#11594

Closed

AWX stops gathering job output if kubernetes starts a new log ansible/awx#11338

Closed

shanemcd mentioned this pull request Dec 20, 2022

AWX jobs can't tolerate the K8s master nodes restart or termination ansible/awx#13350

Closed

9 tasks

filipprosovsky mentioned this pull request Dec 21, 2022

EC2 Inventory Errors, but no error shown ansible/awx#13357

Closed

9 tasks

kzinas-adv mentioned this pull request Dec 23, 2022

Jobs output is cut due to kubernetes container log rotation ansible/awx#13376

Closed

9 tasks

0x7081 mentioned this pull request Jan 4, 2023

Inventory Script parsed but job status is error ansible/awx#13396

Closed

3 tasks

astehlik mentioned this pull request Jan 16, 2023

job output no complete and appear on ERROR in UI Exceeded retries for reading stdout ansible/awx#11803

Closed

6 tasks

fosterseth mentioned this pull request Feb 1, 2023

awx-ee k8s container log showing warnings "Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection ansible/awx-operator#1203

Open

3 tasks

Cl0udius mentioned this pull request Feb 13, 2023

K8S rate limit hits when count of parallel jobs is high ansible/awx#13550

Closed

9 tasks

TheRealHaoLiu mentioned this pull request Jun 6, 2023

k3s max container log size leading to job error ansible/awx#14057

Closed

11 tasks

Klaas- mentioned this pull request Jun 22, 2023

Kube Reconnect Support not detected on k3s 1.27.2+k3s1 #801

Closed

fosterseth mentioned this pull request Jul 26, 2023

Task was marked as running but was not present in the job queue, so it has been marked as failed. ansible/awx#14277

Open

11 tasks

EsDmitrii mentioned this pull request Aug 4, 2023

AWX job fails but it completes successfully ansible/awx#14288

Closed

11 tasks

Mrmel94 mentioned this pull request Sep 18, 2023

Job terminated in error after 4 hours ansible/awx#14457

Closed

11 tasks

Cl0udius mentioned this pull request Oct 19, 2023

Unsufficient timeout for opening K8S logstream #883

Open

Dodexq mentioned this pull request Dec 8, 2023

Failed to JSON parse a line from worker stream due to unexpected EOF(b'') ansible/awx#14693

Open

11 tasks

mcen1 mentioned this pull request Dec 27, 2023

Job end in Failed without error when it had large logs ansible/awx-operator#1675

Closed

3 tasks

TheRealHaoLiu mentioned this pull request Feb 14, 2024

Jobs are killed after 4 hours ansible/awx#14870

Closed

11 tasks

fosterseth mentioned this pull request Feb 21, 2024

awx-ee Error opening log stream for pod Failed to load proxyconnect tcp: proxy error from 127.0.0.1:6443 ansible/awx#14890

Closed

11 tasks

fosterseth mentioned this pull request Apr 5, 2024

job failed with zipfile.BadZipFile: File is not a zip file error from time to time ansible/awx#12343

Open

6 tasks

fosterseth mentioned this pull request Apr 12, 2024

Job failure as minikube logs rotate ansible/awx#15105

Closed

11 tasks

domq pushed a commit to epfl-si/wp-ops that referenced this pull request Sep 27, 2024

[workaround] AWX known log truncation bug

fd0df34

As seen in ansible/awx#11338 and ansible/receptor#446 - Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683 - Pump up timeouts thereof

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683

Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683

TheRealHaoLiu commented Oct 25, 2022 •

edited

Loading

egmar commented Oct 26, 2022 •

edited

Loading

shanemcd commented Oct 26, 2022

Vitexus commented Oct 27, 2022

shanemcd commented Oct 27, 2022

Vitexus commented Oct 27, 2022 •

edited

Loading

shanemcd commented Oct 27, 2022

shanemcd commented Oct 27, 2022

shanemcd commented Oct 27, 2022

TheRealHaoLiu commented Oct 27, 2022

shanemcd commented Oct 27, 2022

Vitexus commented Oct 31, 2022

Vitexus commented Oct 31, 2022

TheRealHaoLiu commented Oct 31, 2022

gundalow commented Nov 1, 2022 •

edited by shanemcd

Loading

TheRealHaoLiu commented Nov 1, 2022 •

edited

Loading

TheRealHaoLiu commented Nov 2, 2022

shanemcd Nov 6, 2022

shanemcd Nov 8, 2022

TheRealHaoLiu commented Nov 7, 2022

TheRealHaoLiu commented Nov 7, 2022 •

edited

Loading

egmar commented Nov 7, 2022 •

edited

Loading

oliverf1 commented Jan 18, 2023

kingtutt1906 commented Feb 4, 2023

oliverf1 commented Feb 20, 2023 •

edited

Loading

Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683

Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683

Conversation

TheRealHaoLiu commented Oct 25, 2022 • edited Loading

egmar commented Oct 26, 2022 • edited Loading

shanemcd commented Oct 26, 2022

Vitexus commented Oct 27, 2022

shanemcd commented Oct 27, 2022

Vitexus commented Oct 27, 2022 • edited Loading

shanemcd commented Oct 27, 2022

shanemcd commented Oct 27, 2022

shanemcd commented Oct 27, 2022

TheRealHaoLiu commented Oct 27, 2022

shanemcd commented Oct 27, 2022

Vitexus commented Oct 31, 2022

Vitexus commented Oct 31, 2022

TheRealHaoLiu commented Oct 31, 2022

gundalow commented Nov 1, 2022 • edited by shanemcd Loading

TheRealHaoLiu commented Nov 1, 2022 • edited Loading

TheRealHaoLiu commented Nov 2, 2022

shanemcd Nov 6, 2022

Choose a reason for hiding this comment

shanemcd Nov 8, 2022

Choose a reason for hiding this comment

TheRealHaoLiu commented Nov 7, 2022

TheRealHaoLiu commented Nov 7, 2022 • edited Loading

egmar commented Nov 7, 2022 • edited Loading

oliverf1 commented Jan 18, 2023

kingtutt1906 commented Feb 4, 2023

oliverf1 commented Feb 20, 2023 • edited Loading

TheRealHaoLiu commented Oct 25, 2022 •

edited

Loading

egmar commented Oct 26, 2022 •

edited

Loading

Vitexus commented Oct 27, 2022 •

edited

Loading

gundalow commented Nov 1, 2022 •

edited by shanemcd

Loading

TheRealHaoLiu commented Nov 1, 2022 •

edited

Loading

TheRealHaoLiu commented Nov 7, 2022 •

edited

Loading

egmar commented Nov 7, 2022 •

edited

Loading

oliverf1 commented Feb 20, 2023 •

edited

Loading