Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support configuring readiness probe timeout #197

Open
jsenko opened this issue Mar 14, 2023 · 2 comments
Open

Support configuring readiness probe timeout #197

jsenko opened this issue Mar 14, 2023 · 2 comments

Comments

@jsenko
Copy link
Member

jsenko commented Mar 14, 2023

See https://docs.openshift.com/container-platform/4.12/applications/application-health.html

This helps with an edge case in kafkasql storage, where a huge number of artifacts in the topic causes the pod to take too much time to get ready, resulting in a restart loop.

@martin-aders
Copy link

martin-aders commented Jul 11, 2023

edit 2: Found the root cause for slow startup times: Every time the registry restarted, an empty message is written to the kafka topic. Combined with the memory leak which caused the registry to restart frequently (every ~15min), adding empty messages caused >20k messages on the topic, which all were read+discarded+logged as tombstone messages during startup. To fix this, we backed up the schemes, set the retention time to clean up all old messages, then imported the schemes again. The referenced blog post did not help in our case though as the described steps did not delete the old messages. Now startup time on the cleaned up kafkasql topic dropped from 139s down to 5s. -> Suggest to prefer cleaning up the topic before sacrificing readiness probe timeouts.

edit 1: the slow startup time and tombstone messages seem to be related to an older bug that got fixed but requires manual cleanup. Instead of raising startup probe timeouts, consider fixing the root cause instead (see https://www.apicur.io/blog/2021/12/09/kafkasql-storage-and-security).

Confirming the same issue with slow initial startup of the registry (as of commit d490f6e / 1.1.0-dev). Since the operator does not reconcile changes on the probes of its Deployments, a workaround is to just patch the Deployment after the operator created it. Though this change will likely become overridden after a config change on your ApicurioRegistry resources:

kubectl patch --namespace apicurio Deployment/apicurio-deployment --patch '{"spec":{"template":{"spec":{"containers":[{"name": "apicurio", "startupProbe":{"initialDelaySeconds":60,"failureThreshold": 90,"periodSeconds": 10,"successThreshold": 1,"timeoutSeconds": 5,"httpGet":{"path": "/health/live","port": 8080,"scheme": "HTTP"}}}]}}}}'

In my case, the registry took up to 30min to start up (4 restart attempts), and issues thousands of these messages:

(KSQL Kafka Consumer Thread) Discarded a (presumed) tombstone message with key: ArtifactVersionKey

Would prefer to fix the root cause of the slow startup if possible, even though configurable probes would be nice too (along with resource/limit config). Anyone knows how this slow startup can happen at all / how to avoid that?

@carlesarnal
Copy link
Member

We have good news to share here. We have started an implementation of a snapshotting mechanism that will reduce the startup time significantly. This new feature will be available with Apicurio Registry 3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants