AWX jobs can't tolerate the EKS scale out #14293

elibogomolnyi · 2023-07-28T10:08:25Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.
I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

We have an environment for performance tests where we try to run more than 400 heavy workflows simultaneously. It causes the EKS control plane (master nodes) to scale out, and about half of our pods fail with errors. We've already opened an issue about it when we were reproducing the K8s master node termination in the kOps cluster (since we cannot do it with EKS) and it was fixed for the kOps clusters here, thanks to the community and especially to @TheRealHaoLiu. We also validated it with AWS support, asked them to terminate K8s nodes manually, and validated that the issue was resolved. The problem is that it is not fully applied to the scale-in or scale-out process of the EKS control plane.

In our case, it fails for EKS when the control plane scales out. For now, we are running the control-plane-ee with the following configuration:

  ee_extra_env: |-
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled
    - name: RECEPTOR_LOG_LEVEL
      value: debug

After the bugfix is merged, some pods can tolerate it and finish running successfully when the scale-out happens, and it can be considered an intermediate result. I see the following logs related to them (the second error happens occasionally and doesn't affect the AWX functionality)

automation-job-141264-fxwvc (finished successfully)

INFO 2023/07/27 20:17:38 [OqRdMEsY] Detected http2.GoAwayError for pod awx-workers/automation-job-141264-fxwvc. Will retry 5 more times. Error: http2: server sent GOAWAY and closed the connection; LastStreamID=6491, ErrCode=NO_ERROR, debug=""
ERROR 2023/07/27 20:26:47 [OqRdMEsY] Error reading from pod awx-workers/automation-job-141264-fxwvc: context canceled

But there are still many pods that fail with error. Here are variations of logs that we can see (we show all logs in control-plane-ee that are related to the particular job)

automation-job-143331-7lggn (failed)
ERROR 2023/07/27 20:19:56 [JwdC5LdB] Error reading from pod awx-workers/automation-job-143331-7lggn: unexpected EOF

automation-job-140003-gq82x (failed)

INFO 2023/07/27 20:17:04 [xJDCo0r6] Detected http2.GoAwayError for pod awx-workers/automation-job-140003-gq82x. Will retry 5 more times. Error: http2: server sent GOAWAY and closed the connection; LastStreamID=3639, ErrCode=NO_ERROR, debug=""
ERROR 2023/07/27 20:19:56 [xJDCo0r6] Error reading from pod awx-workers/automation-job-140003-gq82x: unexpected EOF

automation-job-142724-mhdcm (failed)
ERROR 2023/07/27 20:17:04 [7Ewis9Pm] Error reading from pod awx-workers/automation-job-142724-mhdcm: read tcp 100.64.85.69:40134->172.27.2.235:443: read: connection reset by peer

We still have our performance environment up and running and will gladly provide more logs if needed. Please tell us if changing the DEBUG log level to TRACE makes sense and can give us more info about the issue.

AWX version

22.5.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run more than 400 heavy workflows in EKS cluster that will cause the EKS control plane scale out

Expected results

Jobs can tolerate the EKS control plane scale out

Actual results

Half of the jobs fail with error

Additional information

The EKS cluster version is 1.24

The text was updated successfully, but these errors were encountered:

TheRealHaoLiu · 2023-08-08T17:37:00Z

previous related issue #13350

github-actions bot added component:api component:awx_collection issues related to the collection for controlling AWX needs_triage type:bug community labels Jul 28, 2023

elibogomolnyi changed the title ~~AWX jobs still can't tolerate the EKS scale out~~ AWX jobs can't tolerate the EKS scale out Jul 28, 2023

fosterseth assigned TheRealHaoLiu Aug 2, 2023

fosterseth removed the needs_triage label Aug 2, 2023

AaronH88 mentioned this issue Aug 15, 2023

Expand retry conditions for K8 logs ansible/receptor#818

Merged

elibogomolnyi mentioned this issue Sep 12, 2023

AWX Community Meeting Agenda - Sep 2023 #14346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWX jobs can't tolerate the EKS scale out #14293

AWX jobs can't tolerate the EKS scale out #14293

elibogomolnyi commented Jul 28, 2023 •

edited

Loading

TheRealHaoLiu commented Aug 8, 2023

AWX jobs can't tolerate the EKS scale out #14293

AWX jobs can't tolerate the EKS scale out #14293

Comments

elibogomolnyi commented Jul 28, 2023 • edited Loading

Please confirm the following

Bug Summary

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

TheRealHaoLiu commented Aug 8, 2023

elibogomolnyi commented Jul 28, 2023 •

edited

Loading