Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX jobs can't tolerate the EKS scale out #14293

Open
6 of 11 tasks
elibogomolnyi opened this issue Jul 28, 2023 · 1 comment · Fixed by ansible/receptor#818
Open
6 of 11 tasks

AWX jobs can't tolerate the EKS scale out #14293

elibogomolnyi opened this issue Jul 28, 2023 · 1 comment · Fixed by ansible/receptor#818
Assignees
Labels
community component:api component:awx_collection issues related to the collection for controlling AWX type:bug

Comments

@elibogomolnyi
Copy link

elibogomolnyi commented Jul 28, 2023

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

We have an environment for performance tests where we try to run more than 400 heavy workflows simultaneously. It causes the EKS control plane (master nodes) to scale out, and about half of our pods fail with errors. We've already opened an issue about it when we were reproducing the K8s master node termination in the kOps cluster (since we cannot do it with EKS) and it was fixed for the kOps clusters here, thanks to the community and especially to @TheRealHaoLiu. We also validated it with AWS support, asked them to terminate K8s nodes manually, and validated that the issue was resolved. The problem is that it is not fully applied to the scale-in or scale-out process of the EKS control plane.

In our case, it fails for EKS when the control plane scales out. For now, we are running the control-plane-ee with the following configuration:

  ee_extra_env: |-
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled
    - name: RECEPTOR_LOG_LEVEL
      value: debug

After the bugfix is merged, some pods can tolerate it and finish running successfully when the scale-out happens, and it can be considered an intermediate result. I see the following logs related to them (the second error happens occasionally and doesn't affect the AWX functionality)

automation-job-141264-fxwvc (finished successfully)

INFO 2023/07/27 20:17:38 [OqRdMEsY] Detected http2.GoAwayError for pod awx-workers/automation-job-141264-fxwvc. Will retry 5 more times. Error: http2: server sent GOAWAY and closed the connection; LastStreamID=6491, ErrCode=NO_ERROR, debug=""
ERROR 2023/07/27 20:26:47 [OqRdMEsY] Error reading from pod awx-workers/automation-job-141264-fxwvc: context canceled

But there are still many pods that fail with error. Here are variations of logs that we can see (we show all logs in control-plane-ee that are related to the particular job)

automation-job-143331-7lggn (failed)
ERROR 2023/07/27 20:19:56 [JwdC5LdB] Error reading from pod awx-workers/automation-job-143331-7lggn: unexpected EOF

automation-job-140003-gq82x (failed)

INFO 2023/07/27 20:17:04 [xJDCo0r6] Detected http2.GoAwayError for pod awx-workers/automation-job-140003-gq82x. Will retry 5 more times. Error: http2: server sent GOAWAY and closed the connection; LastStreamID=3639, ErrCode=NO_ERROR, debug=""
ERROR 2023/07/27 20:19:56 [xJDCo0r6] Error reading from pod awx-workers/automation-job-140003-gq82x: unexpected EOF

automation-job-142724-mhdcm (failed)
ERROR 2023/07/27 20:17:04 [7Ewis9Pm] Error reading from pod awx-workers/automation-job-142724-mhdcm: read tcp 100.64.85.69:40134->172.27.2.235:443: read: connection reset by peer

We still have our performance environment up and running and will gladly provide more logs if needed. Please tell us if changing the DEBUG log level to TRACE makes sense and can give us more info about the issue.

AWX version

22.5.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run more than 400 heavy workflows in EKS cluster that will cause the EKS control plane scale out

Expected results

Jobs can tolerate the EKS control plane scale out

Actual results

Half of the jobs fail with error

Additional information

The EKS cluster version is 1.24

@github-actions github-actions bot added component:api component:awx_collection issues related to the collection for controlling AWX needs_triage type:bug community labels Jul 28, 2023
@elibogomolnyi elibogomolnyi changed the title AWX jobs still can't tolerate the EKS scale out AWX jobs can't tolerate the EKS scale out Jul 28, 2023
@TheRealHaoLiu
Copy link
Member

previous related issue #13350

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community component:api component:awx_collection issues related to the collection for controlling AWX type:bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants