Add retries to NNCP deployment #853

lewisdenny · 2024-06-14T02:09:29Z

Sometimes applying the NNCP CR fails due to once of the kubernetes-nmstate probes failing, usually the api-server.

There are two ideas considered to fix the issue:

Extend the probe timeout
Delete and reapply the NNCP CR

Results:

Sadly the probe timeout is hard coded [1] [2]
This is the path I choose, deleting the NNCP/NNCE resource and reapplying it recovers the issue and gives the OCP environment time to settle.

[1] nmstate/kubernetes-nmstate#831
[2] https://github.com/nmstate/kubernetes-nmstate/blob/272b29459ca99a9de62af24f509afe9f3deb058f/pkg/probe/probes.go#L59-L71JIRA: https://issues.redhat.com/browse/OSPRH-7605

openshift-ci · 2024-06-14T02:09:33Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

softwarefactory-project-zuul · 2024-06-14T02:56:26Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/cc15ac0dc5fa4477ae03b86426dbcbf8

✔️ openstack-k8s-operators-content-provider SUCCESS in 45m 51s
❌ install-yamls-crc-podified-edpm-baremetal FAILURE in 19m 54s
❌ podified-multinode-edpm-deployment-crc FAILURE in 21m 21s
❌ cifmw-data-plane-adoption-osp-17-to-extracted-crc RETRY_LIMIT in 28m 05s
❌ cifmw-data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph RETRY_LIMIT in 29m 09s

softwarefactory-project-zuul · 2024-06-14T06:07:02Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/e8039ea769ce4835aac615ced8f4491a

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 53m 35s
❌ install-yamls-crc-podified-edpm-baremetal FAILURE in 1h 27m 33s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 38m 13s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc SUCCESS in 2h 27m 55s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph SUCCESS in 2h 36m 12s

lewisdenny · 2024-06-15T03:42:00Z

recheck
no pod logs captured to see whats happening in the ansibleee job :/

SeanMooney · 2024-06-18T01:01:06Z

scripts/retry_make_nncp.sh

+
+while true; do
+    make nncp && break
+    make nncp_cleanup


i feel like this is not a stable pattern

when we delete and recreate the resource even if its the same content its going to be a separate object at the k8s level so without an exponential backoof i.e. waiting longer each time we call make nncp i think its possible that we ould be in a loop just because we kept restarting before it could make progress.

should we increase the NNCP_TIMEOUT slightly on each retry?

i feel like this is not a stable pattern

It's not ideal but it's a worst case recovery, we already give the NNCP CR 240s seconds to complete and I haven't seen any jobs failing due to timeout.

i think its possible that we ould be in a loop just because we kept restarting before it could make progress.

I haven't seen any job fail due to NNCP CR taking over 240s to apply. This loop is limited to 5 iterations so max it would be stuck for is 20mins and this would only be if there is a real issue with the NNCP CR that needs to be fixed, aka a real bug.

should we increase the NNCP_TIMEOUT slightly on each retry?

No I don't believe we should, I believe it's high enough already unless you have seen jobs failing to apply the NNCP in under 240 seconds?

softwarefactory-project-zuul · 2024-06-18T01:53:17Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/429261c567a74daa88e729d1ee8e15fa

❌ openstack-k8s-operators-content-provider POST_FAILURE in 3h 03m 41s
✔️ install-yamls-crc-podified-edpm-baremetal SUCCESS in 1h 16m 50s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 26s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc SUCCESS in 2h 47m 24s
✔️ cifmw-data-plane-adoption-osp-17-to-extracted-crc-minimal-no-ceph SUCCESS in 2h 44m 11s

lewisdenny · 2024-06-18T01:59:17Z

recheck

unrelated failure in post job collect logs failed:

kex_exchange_identification: read: Connection reset by peer

rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.3]

bshephar

If it helps bring some additional stability to CI jobs, then lgtm.

ciecierski

I had problem with nncp on my local environment and this pr fixed it so I'm giving LGTM. However I would be also interested why we need this oc apply retires in the first place.

openshift-ci · 2024-06-20T12:25:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bshephar, ciecierski, fao89, lewisdenny

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fao89]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Jun 14, 2024

lewisdenny force-pushed the main branch from f8425be to 96bd0be Compare June 14, 2024 03:12

Add retries to NNCP deployment

7b24d4e

lewisdenny force-pushed the main branch from 96bd0be to 7b24d4e Compare June 17, 2024 22:48

lewisdenny marked this pull request as ready for review June 17, 2024 22:58

openshift-ci bot removed the do-not-merge/work-in-progress label Jun 17, 2024

openshift-ci bot requested review from cjeanner and stuggi June 17, 2024 22:58

SeanMooney reviewed Jun 18, 2024

View reviewed changes

bshephar approved these changes Jun 19, 2024

View reviewed changes

openshift-ci bot assigned bshephar Jun 19, 2024

openshift-ci bot added the lgtm label Jun 19, 2024

ciecierski approved these changes Jun 20, 2024

View reviewed changes

openshift-ci bot assigned ciecierski Jun 20, 2024

fao89 approved these changes Jun 20, 2024

View reviewed changes

openshift-ci bot assigned fao89 Jun 20, 2024

openshift-ci bot added the approved label Jun 20, 2024

openshift-merge-bot bot merged commit 3c1d04d into openstack-k8s-operators:main Jun 20, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries to NNCP deployment #853

Add retries to NNCP deployment #853

lewisdenny commented Jun 14, 2024 •

edited

Loading

openshift-ci bot commented Jun 14, 2024

softwarefactory-project-zuul bot commented Jun 14, 2024

softwarefactory-project-zuul bot commented Jun 14, 2024

lewisdenny commented Jun 15, 2024

SeanMooney Jun 18, 2024

lewisdenny Jun 19, 2024 •

edited

Loading

softwarefactory-project-zuul bot commented Jun 18, 2024

lewisdenny commented Jun 18, 2024

bshephar left a comment

ciecierski left a comment

openshift-ci bot commented Jun 20, 2024

Add retries to NNCP deployment #853

Add retries to NNCP deployment #853

Conversation

lewisdenny commented Jun 14, 2024 • edited Loading

openshift-ci bot commented Jun 14, 2024

softwarefactory-project-zuul bot commented Jun 14, 2024

softwarefactory-project-zuul bot commented Jun 14, 2024

lewisdenny commented Jun 15, 2024

SeanMooney Jun 18, 2024

Choose a reason for hiding this comment

lewisdenny Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jun 18, 2024

lewisdenny commented Jun 18, 2024

bshephar left a comment

Choose a reason for hiding this comment

ciecierski left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 20, 2024

lewisdenny commented Jun 14, 2024 •

edited

Loading

lewisdenny Jun 19, 2024 •

edited

Loading