Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job end in Failed without error when it had large logs #1675

Closed
3 tasks done
matiasperkins opened this issue Dec 27, 2023 · 4 comments
Closed
3 tasks done

Job end in Failed without error when it had large logs #1675

matiasperkins opened this issue Dec 27, 2023 · 4 comments

Comments

@matiasperkins
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi!

We are having an issue with jobs with large logs suddenly ending without error.
I check the logs on K8s but they are normal until they lose the connection because of the end of the job. I also look if there is a configuration about the extension of the logs or something like that but I do not find it.

With one of the projects, we did a workaround running the long playbook inside a small one. Maybe is a small problem of configuration or something I'm not seeing, but I didn't find the way to make it work.

AWX Operator version

1.0.0

AWX version

21.8.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

Tanzu Cluster v1.23.8+vmware.3

Modifications

no

Steps to reproduce

Any playbook that has a very long long, around 15k lines

Expected results

the job ends no matter the log large

Actual results

the job ends without any error and with the log incomplete

Additional information

No response

Operator Logs

No response

@mcen1
Copy link

mcen1 commented Dec 27, 2023

This sounds similar to an issue I had in AWX where the job output would be truncated and the logs would end with an "error" even though there was no ending output and it seemed like the job did everything it needed to do successfully.

We solved it for my case by increasing the container log size parameter. We are on RKE and this was a configuration we had to add to a YAML file on the OS. ansible/awx#10366 (comment) I think this link describes it, we just set container_log_max_size_mb to like 500 megs. k8s does garbage cleanup of the containers afterward, so you don't need to be too concerned with those big logs staying out there a while, but of course keep an eye out to be safe.

I'm not as familiar with Tanzu so I don't know how you'd configure an equivalent in there, but it's my understanding that this code fix pushed a while ago to this component of the AWX econsystem fixes the issue on the AWX side without kubernetes/container log size param changes: ansible/receptor#683 I believe you need to set RECEPTOR_KUBE_SUPPORT_RECONNECT as an environment variable in the AWX operator CRD as noted here #1203 (comment) provided you're on a version of k8s and awx-operator that honor this parameter and the accompanying code changes. Haven't had the ability to test this out myself but the folks working on this project are way smarter than I am, so probably works.

Hope this is helpful, sorry if I misstated anything.

@matiasperkins
Copy link
Author

matiasperkins commented Dec 29, 2023

Hi @mcen1 ! Thank you very much for the info. My error is like the one you describe.

I tried to set RECEPTOR_KUBE_SUPPORT_RECONNECT but didn't work and I have bad news for me, the product of Tanzu I have can't config containerLogMaxSize.

Do you think I can config a log aggregator and fix the issue?

Edit:
I tried it on a k3s cluster and changing containerLogMaxSize with this config works without any problem

ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \
        'container-log-max-files=4' \
        '--kubelet-arg' \
        'container-log-max-size=50Mi' \

Now I need to continue on Tanzu, the idea is to work with that tool

@mcen1
Copy link

mcen1 commented Dec 29, 2023

I can't say for sure whether using a log aggregator will be sent all the logs, I'm not really clear on the internals of how it all works. I can say that setting up a log aggregator won't solve the issue of being unable to see the full job output inside AWX itself.

I have no idea how Tanzu works, but a google search brings up this: https://kb.vmware.com/s/article/87107 and it resembles what you might have to do. Maybe someone with more experience than me can say if anything more needs to be done with that RECEPTOR_KUBE_SUPPORT_RECONNECT environment variable, because it was my understanding that's supposed to be the "right" way to fix this issue.

@matiasperkins
Copy link
Author

I can't say for sure whether using a log aggregator will be sent all the logs, I'm not really clear on the internals of how it all works. I can say that setting up a log aggregator won't solve the issue of being unable to see the full job output inside AWX itself.

I have no idea how Tanzu works, but a google search brings up this: https://kb.vmware.com/s/article/87107 and it resembles what you might have to do. Maybe someone with more experience than me can say if anything more needs to be done with that RECEPTOR_KUBE_SUPPORT_RECONNECT environment variable, because it was my understanding that's supposed to be the "right" way to fix this issue.

Thank you very much for the help @mcen1 ! I will talk with the owner of Tanzu and see if we can fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants