Support configuring readiness probe timeout #197

jsenko · 2023-03-14T13:23:33Z

See https://docs.openshift.com/container-platform/4.12/applications/application-health.html

This helps with an edge case in kafkasql storage, where a huge number of artifacts in the topic causes the pod to take too much time to get ready, resulting in a restart loop.

martin-aders · 2023-07-11T08:19:06Z

edit 2: Found the root cause for slow startup times: Every time the registry restarted, an empty message is written to the kafka topic. Combined with the memory leak which caused the registry to restart frequently (every ~15min), adding empty messages caused >20k messages on the topic, which all were read+discarded+logged as tombstone messages during startup. To fix this, we backed up the schemes, set the retention time to clean up all old messages, then imported the schemes again. The referenced blog post did not help in our case though as the described steps did not delete the old messages. Now startup time on the cleaned up kafkasql topic dropped from 139s down to 5s. -> Suggest to prefer cleaning up the topic before sacrificing readiness probe timeouts.

edit 1: the slow startup time and tombstone messages seem to be related to an older bug that got fixed but requires manual cleanup. Instead of raising startup probe timeouts, consider fixing the root cause instead (see https://www.apicur.io/blog/2021/12/09/kafkasql-storage-and-security).

Confirming the same issue with slow initial startup of the registry (as of commit d490f6e / 1.1.0-dev). Since the operator does not reconcile changes on the probes of its Deployments, a workaround is to just patch the Deployment after the operator created it. Though this change will likely become overridden after a config change on your ApicurioRegistry resources:

kubectl patch --namespace apicurio Deployment/apicurio-deployment --patch '{"spec":{"template":{"spec":{"containers":[{"name": "apicurio", "startupProbe":{"initialDelaySeconds":60,"failureThreshold": 90,"periodSeconds": 10,"successThreshold": 1,"timeoutSeconds": 5,"httpGet":{"path": "/health/live","port": 8080,"scheme": "HTTP"}}}]}}}}'

In my case, the registry took up to 30min to start up (4 restart attempts), and issues thousands of these messages:

(KSQL Kafka Consumer Thread) Discarded a (presumed) tombstone message with key: ArtifactVersionKey

Would prefer to fix the root cause of the slow startup if possible, even though configurable probes would be nice too (along with resource/limit config). Anyone knows how this slow startup can happen at all / how to avoid that?

carlesarnal · 2024-05-15T11:51:27Z

We have good news to share here. We have started an implementation of a snapshotting mechanism that will reduce the startup time significantly. This new feature will be available with Apicurio Registry 3.0.

jsenko mentioned this issue Mar 14, 2023

Support a checkpointing mechanism for kafkasql storage Apicurio/apicurio-registry#3194

Closed

jsenko mentioned this issue May 29, 2023

feat: support user-provided pod template (PodSpec) in the ApicurioRegistry CRD (preview feature) #212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support configuring readiness probe timeout #197

Support configuring readiness probe timeout #197

jsenko commented Mar 14, 2023

martin-aders commented Jul 11, 2023 •

edited

Loading

carlesarnal commented May 15, 2024

Support configuring readiness probe timeout #197

Support configuring readiness probe timeout #197

Comments

jsenko commented Mar 14, 2023

martin-aders commented Jul 11, 2023 • edited Loading

carlesarnal commented May 15, 2024

martin-aders commented Jul 11, 2023 •

edited

Loading