Configure dogpile.cache to deal with memcached pods failures #904

lmiccini · 2024-11-29T12:39:14Z

Whenever one of the mecached pods disappears, because of a rolling restart during a minor update or as result of a failure, APIs can take a long time to detect that the pod went away and keep trying to reconnect.

From a quick round of tests we saw downtimes up to ~150s.

By enabling the retry_client and limiting the number of retries the behavior seems much more acceptable.

Similarly, when TLS is not in use, we may want to set a lower value for memcache_dead_retry so to eventually reconnect to a new pod (having the same dns name but different ip) much faster.

Related: OSPRH-11935

softwarefactory-project-zuul · 2024-11-29T15:43:16Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/64e175abee114fb3bf273a352d0e3b48

✔️ openstack-meta-content-provider SUCCESS in 2h 51m 58s
✔️ nova-operator-kuttl SUCCESS in 46m 07s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 36m 05s
❌ nova-operator-tempest-multinode-ceph FAILURE in 1h 25m 26s

auniyal61 · 2024-12-02T05:55:04Z

templates/nova.conf

@@ -172,8 +172,12 @@ enabled = True
 # on contoler we prefer to use memcache when its deployed
 {{if .MemcachedTLS}}
 backend = dogpile.cache.pymemcache
+enable_retry_client = true
+retry_attempts = 2
+retry_delay = 0


the issue is reproduced with server list as well, it does take lot of time, even after several sec of memcached pod start.
but there was no visible change after updating suggested change in confs for me.

Thanks Amit. You need to patch the keystone config as well, otherwise you'll still have keystone waiting for the memcache pod to come back (see openstack-k8s-operators/keystone-operator#511) .

the other thing to consider is we use memcache to cache the server metadata which his expensive to compute. restaring memcache like this effectively clears the cache so new requests will miss the cache and need to hit the db directly, that's fine but it could cause things to timeout.

this si mostly migrated because we are using config drive by default and there for instance boot has no direct dependency on the matadata API but it is just something to be aware of.

any tempest tests that tried to hit the metadata API for an instance that was created before the minor update will take slightly longer the normal to rescive the response beasue its not coming form the cache.

openshift-ci · 2024-12-17T11:24:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lmiccini
Once this PR has been reviewed and has the lgtm label, please assign dprince for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Whenever one of the mecached pods disappears, because of a rolling restart during a minor update or as result of a failure, APIs can take a long time to detect that the pod went away and keep trying to reconnect. From a quick round of tests we saw downtimes up to ~150s. By enabling the retry_client and limiting the number of retries the behavior seems much more acceptable. Similarly, when TLS is not in use, we may want to set a lower value for memcache_dead_retry so to eventually reconnect to a new pod (having the same dns name but different ip) much faster. Jira: https://issues.redhat.com/browse/OSPRH-11935

softwarefactory-project-zuul · 2024-12-17T17:31:31Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8c191d1966ed48289fdca1312404556c

✔️ openstack-meta-content-provider SUCCESS in 3h 33m 00s
❌ nova-operator-kuttl RETRY_LIMIT in 18m 31s
❌ nova-operator-tempest-multinode FAILURE in 22m 04s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 2h 51m 16s

openshift-ci bot requested review from kk7ds and mrkisaolamb November 29, 2024 12:39

auniyal61 reviewed Dec 2, 2024

View reviewed changes

openshift-merge-robot added the needs-rebase label Dec 17, 2024

lmiccini force-pushed the memcached-failover branch from 9616563 to 513919a Compare December 17, 2024 11:24

openshift-merge-robot removed the needs-rebase label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure dogpile.cache to deal with memcached pods failures #904

Configure dogpile.cache to deal with memcached pods failures #904

lmiccini commented Nov 29, 2024 •

edited by openshift-ci bot

Loading

softwarefactory-project-zuul bot commented Nov 29, 2024

auniyal61 Dec 2, 2024

lmiccini Dec 2, 2024

SeanMooney Dec 2, 2024

openshift-ci bot commented Dec 17, 2024

softwarefactory-project-zuul bot commented Dec 17, 2024

Configure dogpile.cache to deal with memcached pods failures #904

Are you sure you want to change the base?

Configure dogpile.cache to deal with memcached pods failures #904

Conversation

lmiccini commented Nov 29, 2024 • edited by openshift-ci bot Loading

softwarefactory-project-zuul bot commented Nov 29, 2024

auniyal61 Dec 2, 2024

Choose a reason for hiding this comment

lmiccini Dec 2, 2024

Choose a reason for hiding this comment

SeanMooney Dec 2, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Dec 17, 2024

softwarefactory-project-zuul bot commented Dec 17, 2024

lmiccini commented Nov 29, 2024 •

edited by openshift-ci bot

Loading