AWX jobs can't tolerate the EKS scale out #14293
Labels
community
component:api
component:awx_collection
issues related to the collection for controlling AWX
type:bug
Please confirm the following
[email protected]
instead.)Bug Summary
We have an environment for performance tests where we try to run more than 400 heavy workflows simultaneously. It causes the EKS control plane (master nodes) to scale out, and about half of our pods fail with errors. We've already opened an issue about it when we were reproducing the K8s master node termination in the kOps cluster (since we cannot do it with EKS) and it was fixed for the kOps clusters here, thanks to the community and especially to @TheRealHaoLiu. We also validated it with AWS support, asked them to terminate K8s nodes manually, and validated that the issue was resolved. The problem is that it is not fully applied to the scale-in or scale-out process of the EKS control plane.
In our case, it fails for EKS when the control plane scales out. For now, we are running the control-plane-ee with the following configuration:
After the bugfix is merged, some pods can tolerate it and finish running successfully when the scale-out happens, and it can be considered an intermediate result. I see the following logs related to them (the second error happens occasionally and doesn't affect the AWX functionality)
automation-job-141264-fxwvc (finished successfully)
But there are still many pods that fail with error. Here are variations of logs that we can see (we show all logs in control-plane-ee that are related to the particular job)
automation-job-143331-7lggn (failed)
ERROR 2023/07/27 20:19:56 [JwdC5LdB] Error reading from pod awx-workers/automation-job-143331-7lggn: unexpected EOF
automation-job-140003-gq82x (failed)
automation-job-142724-mhdcm (failed)
ERROR 2023/07/27 20:17:04 [7Ewis9Pm] Error reading from pod awx-workers/automation-job-142724-mhdcm: read tcp 100.64.85.69:40134->172.27.2.235:443: read: connection reset by peer
We still have our performance environment up and running and will gladly provide more logs if needed. Please tell us if changing the DEBUG log level to TRACE makes sense and can give us more info about the issue.
AWX version
22.5.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Run more than 400 heavy workflows in EKS cluster that will cause the EKS control plane scale out
Expected results
Jobs can tolerate the EKS control plane scale out
Actual results
Half of the jobs fail with error
Additional information
The EKS cluster version is 1.24
The text was updated successfully, but these errors were encountered: