Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry mechanism to handle intermittent connection issues with Kubernetes logging stream #683

Merged

Conversation

TheRealHaoLiu
Copy link
Member

@TheRealHaoLiu TheRealHaoLiu commented Oct 25, 2022

Due to a issue in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed on Kubernetes. More context on that in ansible/awx#11805.

This PR adds logic that will pick back up from the last line we saw, using Kubernetes log timestamp

require fix in Kubernetes for Pod logs: long lines are corrupted when using timestamps=true
fixed in kubernetes/kubernetes#113481

Fixe backported into Kubernetes release branches in the following PRs:
release-1.23 (1.23.14)
release-1.24 (1.24.8)
release-1.25 (1.25.4)

Fixes ported to OpenShift in the following PRs
release-4.9 (4.9.x) (not yet merged)
release-4.10 (4.10.42)
release-4.11 (4.11.16)
release-4.12 (4.12.0)

the fix in this PR should detect the Kubernetes version and use --timestamp according however due to the "wild wild west" nature of Kubernetes world we added RECEPTOR_KUBE_SUPPORT_RECONNECT environment variable to force enable/disable the fix.

RECEPTOR_KUBE_SUPPORT_RECONNECT have following options:

  • “enabled”: this option will use timestamp with the log and enable our new code path
  • “disabled”: this option will not use timestamp and use the original code path
  • “auto”: auto detect if it's appropriate to enable timestamp base on kube version

this flag can be set via awx custom resource

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  …
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled

NOTE: RECEPTOR_KUBE_SUPPORT_RECONNECT will bypass check for ALL container group when set to "enabled/disabled". If a specific container group is not contain have the right Kubernetes version and RECEPTOR_KUBE_SUPPORT_RECONNECT is set to enabled job execution with that container group will fail due to corrupted log stream

NOTE: it is also possible for kublet version to be different from kube-apiserver detecting the presence of the fix using kube-apiserver version is not the safest option, but its the only option we have here.

@egmar
Copy link

egmar commented Oct 26, 2022

Hi,
is this PR also covering the issue described here ansible/awx#11805?

@shanemcd
Copy link
Member

@egmar Yes, we're trying to figure that out. Hopefully along with ansible/awx#11338

@Vitexus
Copy link

Vitexus commented Oct 27, 2022

I can confirm, that testing job keep running whole night.

---
- hosts: "{{ target|default('localhost') }}"
  become: false
  gather_facts: false
  tasks:
   - name: Run Job
     ansible.builtin.shell: 'while true ; do date ; sleep 3600; done'
     register: long_output_job
     ignore_errors: true

Good job!
Many Thanks to @TheRealHaoLiu and looking forward The release 1.2.4.

@shanemcd
Copy link
Member

@Vitexus You are awesome, thanks for testing. We iterated a bit more. Can you try with the latest code here?

@Vitexus
Copy link

Vitexus commented Oct 27, 2022

Sure.

obrazek

@shanemcd
Copy link
Member

I found another bug where it breaks when very long lines are emitted. Will try to figure out a solution later today.

@shanemcd
Copy link
Member

Well... it's not a bug in this code 😰

I have a pod that starts and runs this:

for i in {1..30}; do echo $i; printf "%0.s-" {1..100000}; echo; done

When running $ kubectl logs echo-tcfjg --timestamps, I can see random timestamps in the middle of the output:

image

For whatever reason, there are only 16385 bytes printed, followed by a timestamp, then more bytes.

Since AWX expects each line to be either in-tact json or a full base64 encoded blob of zip data, this will definitely break certain workloads ☹️

@shanemcd
Copy link
Member

Maybe related: kubernetes/kubernetes#77822

@TheRealHaoLiu
Copy link
Member Author

@Vitexus btw love your github bio
are u on our matrix server by any chance? https://matrix.to/#/#awx:ansible.com (there may be a rare IRL angry spud plushy in it for you 😉 )

@shanemcd
Copy link
Member

The problem seems to be even worse in OpenShift. There's only 4097 bytes between the timestamps...

@Vitexus
Copy link

Vitexus commented Oct 31, 2022

obrazek

Sad 👎

@Vitexus
Copy link

Vitexus commented Oct 31, 2022

Ok, next try with latest changes tomorrow ...

@TheRealHaoLiu
Copy link
Member Author

latest idea is to just use local time to resume the log the implementation
this implementation have the problem that we may get repeated lines in the log

if the last few line of job output gets repeated this cause a problem with ansible-runner process and could cause ansible-runner to hang forever

gonna pivot to #685 to leverage socat and portfowarding to stream the log back via TCP

@gundalow
Copy link

gundalow commented Nov 1, 2022

Hao & Seth, can you please update this and get it tested against kubernetes/kubernetes#113481

Thanks

@gundalow gundalow reopened this Nov 1, 2022
@TheRealHaoLiu
Copy link
Member Author

TheRealHaoLiu commented Nov 1, 2022

good news, we were able to convince Kubernetes maintainer that the log issue @shanemcd discover is indeed a bug 🥳

detail of the fix is in kubernetes/kubernetes#113481

with that we can proceed with flushing this PR out

now we should be able to use exact log timestamp to prevent duplication of log messages and resume from the specific message at the time of disconnection

@TheRealHaoLiu
Copy link
Member Author

@Vitexus thanks for caring about this topic and helping us out with this issue, we really appreciate it. Have you heard about our community Matrix chatroom https://matrix.to/#/#awx:ansible.com. Its a great place for collaborating with the AWX community, join us!

@TheRealHaoLiu TheRealHaoLiu changed the title add retry logic to receptor logstream on kube [WIP] add retry logic to receptor logstream on kube Nov 3, 2022
if err == nil {
break
} else {
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, this wasn't quite long enough to recover gracefully when the kubelet is restarted. Changing this to 1 second made it work. I see where we changed 4 other places from 5 seconds to 100 milliseconds. Let's go with somewhere in the middle... maybe 1 or 2 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.warn here too.

@TheRealHaoLiu
Copy link
Member Author

got a test image up quay.io/haoliu/awx-ee:kubernetes-logstream-retry (will continue to update to track the latest change)

@TheRealHaoLiu
Copy link
Member Author

TheRealHaoLiu commented Nov 7, 2022

kubernetes/kubernetes#113481 merged, in our logic we do need to deal with and preserve the old behavior (where job fail after 4hr) if k8s does not contain the logging fix so if anyone have suggestion for how we can deal with that gracefully please comment

we might need a external param to detect if the deployment is on a kube with the patch and enable/disable timestamp accordingly

@egmar
Copy link

egmar commented Nov 7, 2022

kubernetes/kubernetes#113481 merged, in our logic we do need to deal with and preserve the old behavior (where job fail after 4hr) if k8s does not contain the logging fix so if anyone have suggestion for how we can deal with that gracefully please comment

we might need a external param to detect if the deployment is on a kube with the patch and enable/disable timestamp accordingly

I'd use the https://pkg.go.dev/k8s.io/client-go/discovery#DiscoveryClient.ServerVersion and fallback to old behaviour on K8S without the fix.

@oliverf1
Copy link

Hello,
I can also confirm that this fixed the issue for me :-) Thank you all for the fix!
However, I had to build a custom image for the execution environment because in the Containerfile of quay.io/ansible/awx-ee:latest it references quay.io/ansible/receptor:devel but this patch is only available in the latest image
So is there any chance that the devel image being updated with this 1.3.0 version, It's currently using the 1.3.0.dev2 which is older? Or should I create a MR in the ansible/awx-ee repo to update the Containerfile to target quay.io/ansible/receptor:latest?

@kingtutt1906
Copy link

Hello, I can also confirm that this fixed the issue for me :-) Thank you all for the fix! However, I had to build a custom image for the execution environment because in the Containerfile of quay.io/ansible/awx-ee:latest it references quay.io/ansible/receptor:devel but this patch is only available in the latest image So is there any chance that the devel image being updated with this 1.3.0 version, It's currently using the 1.3.0.dev2 which is older? Or should I create a MR in the ansible/awx-ee repo to update the Containerfile to target quay.io/ansible/receptor:latest?

Could you show your custom execution environment build? I tried using the quay.io/ansible/awx-ee:latest execution environment and it still timed out after 4hrs.

@oliverf1
Copy link

oliverf1 commented Feb 20, 2023

Hello, I can also confirm that this fixed the issue for me :-) Thank you all for the fix! However, I had to build a custom image for the execution environment because in the Containerfile of quay.io/ansible/awx-ee:latest it references quay.io/ansible/receptor:devel but this patch is only available in the latest image So is there any chance that the devel image being updated with this 1.3.0 version, It's currently using the 1.3.0.dev2 which is older? Or should I create a MR in the ansible/awx-ee repo to update the Containerfile to target quay.io/ansible/receptor:latest?

Could you show your custom execution environment build? I tried using the quay.io/ansible/awx-ee:latest execution environment and it still timed out after 4hrs.

I never experienced the 4h timeout (because I don´t have such long jobs), but I was experiencing the issue with the log rotation a lot.
The patch is now available in the quay.io/ansible/awx-ee:latest default image, No need to rebuild an image
$ docker run -it quay.io/ansible/awx-ee:latest receptor --version
1.3.1+g3c0278f

The patch was added in version 1.3.0

domq pushed a commit to epfl-si/wp-ops that referenced this pull request Sep 27, 2024
As seen in ansible/awx#11338 and ansible/receptor#446

- Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683
- Pump up timeouts thereof
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.