-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX 19.2.0 - Job Output not in sync - /var/run/receptor/receptor.sock Deadlock ? #10366
Comments
Hello. A couple questions: Is there any traceback either in the job output or in the awx-task container? Did you specify any custom EEs when installing, or are you using the default? |
@shanemcd Thanks for the reply. See updates below:
no traceback in either job output and awx-task.
Yes I did specify a custom EE. It's using the base awx-ee:0.3.0 image + ansible 2.11.1 installed + some network collections. EDIT: After further investigation, I manage to get the output sync'ed with AWX UI by setting |
Thank you for the information. I'm wondering if you could share some more details about your workload:
And if possible, sharing the logs from the worker pod may help us debug. If they contain sensitive information, you can send me an encrypted email instead of posting them here. My email is on my profile, and my gpg public key is at https://github.com/shanemcd.gpg. To obtain the logs, you can prevent the worker pods from being deleted by doing something like this: ---
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-awx-config
data:
custom.py: |
RECEPTOR_RELEASE_WORK = False ---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
spec:
task_extra_volume_mounts: |
- name: custom-py
mountPath: /etc/tower/conf.d/custom.py
subPath: custom.py
extra_volumes: |
- name: custom-py
configMap:
defaultMode: 420
items:
- key: custom.py
path: custom.py
name: custom-awx-config The pod will be left behind after the job is complete so you can grab the logs.
Once you are done, you can undo the custom config and AWX will clean up all of the pods it created. Thank you in advance for any information you can provide. |
Please find details below:
35 (Cisco network devices )
"count": 252
v1.17.7 on Centos 7.9 Job output container logs sent to your email. Thanks |
We also experience really strange issues, it results in Task was marked as running but was not present in the job queue, so it has been marked as failed. It seems the receptor gets killed or stops with it somehow |
Setting more inotify watches on nodes helped for me: Found this because when I was trying to check logs with --follow: I was getting this error:
|
Unfortunately no difference for us :| |
Hello there, We have the same problem as @lucabrasi83.
What I have found out that is probably somehow related to the amount of logs/output. |
Is this issue related #9594 ? It's really severe on our side, the solution is almost unusable. |
I have the same issue and as @andrejzelnik I arrived at the same conclusion that's it's due to the amount of logs.
Also, to prove that it was really not related to AWX, I saved the full logs of a problematic job template in json format and with that I have created a very basic container, that read this file and echo its content to stdout at the same pace as the real job template (using the timestamp in the log). Once the pod running that container starts I attach to it I have tested this on several k3s clusters that I own and I have the exact same issue on all of them. But if I test it on an AWS EKS cluster then it works. HTH |
Thanks @oliverf1 for the findings. Looking at K8S docs regarding logging, it seems there is a default limit log size dictated by Kubelet / CRI : https://kubernetes.io/docs/concepts/cluster-administration/logging/
For example, that can be set in Docker as per the following doc: https://docs.docker.com/config/containers/logging/local/ |
@lucabrasi83 I will check that in more details tomorrow, but I'm not sure it will help. For example if I "cat" the whole log file in one shot, then it works, no hang. Also if I print the log file at a slower pace, like one line per second, then it also works. The only way I found to reproduce the issue is to echo the log at the same pace as the job template is doing. |
@oliverf1 Hi, interesting that you saw the log stopping in the running kubectl aswell. We obeserved that using Digging deeper we ran tcpdump on the awx-ee node recording the traffic. For the traffic from our kubectl (v1.20) there are L5 packets even if no data is visibly send/received by kubectl (probably some kind of applications layer heartbeat). Also keepalives are send and received every 30s. For the client-go package (v0.18.6 per https://github.com/ansible/receptor/blob/1.0.0a2/go.mod) there are keepalives but no L5 traffic and after exactly 3600s the kube apiserver sends a TCP FIN packet signaling the end of the connection. Because this is a managed service we have limited access to the k8s control plane to debug (Azure). For important jobs/tasks we have introduced the async/poll feature to create some applications layer (L5) noise, this works as a workaround for us. Azure Kubernetes (v1.20) Service / Operator 0.12 / AWX 19.2.2 / Some customization (mounts and custom user EE) |
@oliverf1, I came to the same conclusion as you: the job failed when the logs get rotated by k3s. So it's indeed not an AWX problem, but k3s. curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"" sh - My logs, didnt get rotated anymore and my k3s instance is to configured to have 4 log files of 50MB, allowing for 200MB of total space to be retained per container. Sidenote: Haven't had any log bigger than 50MB yet, so I don't know yet what happens when that 2nd, 3rd etc log gets created. |
@nicovs That is a very good catch! I can confirm that indeed kubectl -f logs stops when the log file at the node level are rotated. |
Just wanted to confirm that the same options @nicovs used, also work for us on Azure Kubernetes Service (AKS) to fix truncated log output.
You may need to adopt this config. We used 500MB as we have a lot of machines and tasks, I think the old log size was 10 or 40MB. If you want to checkout your kubelet config (before/after) you can use the
Ref: https://docs.microsoft.com/en-us/azure/aks/ssh#create-the-ssh-connection-to-a-linux-node Sorry if this was too off-topic/Azure specific. Maybe this helps @andrejzelnik |
I also can confirm that Increasing file size of container logs solves this problem. With on-premise k8s cluster installed with kubeadm you can do it by adding containerLogMaxSize: 500Mi to /var/lib/kubelet/config.yaml and then restart kubelet service. You may want to also decrease number of present files by adjusting containerLogMaxFiles option so that not to run out of space. But keep in mind that it cannot be less than 2. |
Thank you @Kardi5, I have tried out your proposal and it works now. |
Hello, Here we can't modify the kubelet config from GKE cluster. We use official google image and we don't want to manage our own image. This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool. |
Increasing the log size works but not quite a solution when using a managed k8s service where you cannot access this :-( |
Even if you can change the log size, increasing it over and over is not a solution. In my case I have a very large job running on many nodes that is generating about 600Mb of log, so to run it successfully it means my log size should be about 800Mb. So it means that every and each container running in this cluster can write up to 800Mb of log. This is not sustainable. So my solution was to run this job ... outside of AWX. |
Closing this in favor of #11338 |
ISSUE TYPE
SUMMARY
Hi,
For some jobs, we're experiencing an out of sync AWX Job output such as below where the job is marked as failed with no summary:
Issue looks similar to #9967 although I don't get any error in the logs of awx-web and awx-task containers.
From the actual Pod logs running the job, I can see it actually completed successfully.
However, the awx-ee container logs show what looks like a deadlock connection error to the receptor socket file:
From observations, it looks like to happen on fast running jobs. One workaround I found is to enable -vvv debugs which tend to slow down the job execution and then properly sync the job output with AWX.
ENVIRONMENT
STEPS TO REPRODUCE
Launch a template job
EXPECTED RESULTS
Expected to see the actual summary of the job results.
ACTUAL RESULTS
ADDITIONAL INFORMATION
The text was updated successfully, but these errors were encountered: