Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to NNCP deployment #853

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

lewisdenny
Copy link
Contributor

@lewisdenny lewisdenny commented Jun 14, 2024

Sometimes applying the NNCP CR fails due to once of the kubernetes-nmstate probes failing, usually the api-server.

There are two ideas considered to fix the issue:

  1. Extend the probe timeout
  2. Delete and reapply the NNCP CR

Results:

  1. Sadly the probe timeout is hard coded [1] [2]
  2. This is the path I choose, deleting the NNCP/NNCE resource and reapplying it recovers the issue and gives the OCP environment time to settle.

[1] nmstate/kubernetes-nmstate#831
[2] https://github.com/nmstate/kubernetes-nmstate/blob/272b29459ca99a9de62af24f509afe9f3deb058f/pkg/probe/probes.go#L59-L71JIRA: https://issues.redhat.com/browse/OSPRH-7605

Copy link
Contributor

openshift-ci bot commented Jun 14, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/cc15ac0dc5fa4477ae03b86426dbcbf8

✔️ openstack-k8s-operators-content-provider SUCCESS in 45m 51s
install-yamls-crc-podified-edpm-baremetal FAILURE in 19m 54s
podified-multinode-edpm-deployment-crc FAILURE in 21m 21s
cifmw-data-plane-adoption-osp-17-to-extracted-crc RETRY_LIMIT in 28m 05s
cifmw-data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph RETRY_LIMIT in 29m 09s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/e8039ea769ce4835aac615ced8f4491a

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 53m 35s
install-yamls-crc-podified-edpm-baremetal FAILURE in 1h 27m 33s
podified-multinode-edpm-deployment-crc FAILURE in 1h 38m 13s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc SUCCESS in 2h 27m 55s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph SUCCESS in 2h 36m 12s

@lewisdenny
Copy link
Contributor Author

recheck
no pod logs captured to see whats happening in the ansibleee job :/

@lewisdenny lewisdenny marked this pull request as ready for review June 17, 2024 22:58
@openshift-ci openshift-ci bot requested review from cjeanner and stuggi June 17, 2024 22:58

while true; do
make nncp && break
make nncp_cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this is not a stable pattern

when we delete and recreate the resource even if its the same content its going to be a separate object at the k8s level so without an exponential backoof i.e. waiting longer each time we call make nncp i think its possible that we ould be in a loop just because we kept restarting before it could make progress.

should we increase the NNCP_TIMEOUT slightly on each retry?

Copy link
Contributor Author

@lewisdenny lewisdenny Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this is not a stable pattern

It's not ideal but it's a worst case recovery, we already give the NNCP CR 240s seconds to complete and I haven't seen any jobs failing due to timeout.

i think its possible that we ould be in a loop just because we kept restarting before it could make progress.

I haven't seen any job fail due to NNCP CR taking over 240s to apply. This loop is limited to 5 iterations so max it would be stuck for is 20mins and this would only be if there is a real issue with the NNCP CR that needs to be fixed, aka a real bug.

should we increase the NNCP_TIMEOUT slightly on each retry?

No I don't believe we should, I believe it's high enough already unless you have seen jobs failing to apply the NNCP in under 240 seconds?

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/429261c567a74daa88e729d1ee8e15fa

openstack-k8s-operators-content-provider POST_FAILURE in 3h 03m 41s
✔️ install-yamls-crc-podified-edpm-baremetal SUCCESS in 1h 16m 50s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 26s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc SUCCESS in 2h 47m 24s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph SUCCESS in 2h 44m 11s

@lewisdenny
Copy link
Contributor Author

recheck

unrelated failure in post job collect logs failed:

kex_exchange_identification: read: Connection reset by peer

rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.3]

Copy link
Contributor

@bshephar bshephar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps bring some additional stability to CI jobs, then lgtm.

Copy link
Contributor

@ciecierski ciecierski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had problem with nncp on my local environment and this pr fixed it so I'm giving LGTM. However I would be also interested why we need this oc apply retires in the first place.

Copy link
Contributor

openshift-ci bot commented Jun 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bshephar, ciecierski, fao89, lewisdenny

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 3c1d04d into openstack-k8s-operators:main Jun 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants