Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/alive and /ready health checks do not consider subsystem health #1865

Closed
dimabarbul opened this issue Jan 17, 2024 Discussed in #1861 · 1 comment · Fixed by #1867
Closed

/alive and /ready health checks do not consider subsystem health #1865

dimabarbul opened this issue Jan 17, 2024 Discussed in #1861 · 1 comment · Fixed by #1867
Assignees
Labels
Milestone

Comments

@dimabarbul
Copy link
Contributor

Discussed in #1861

Discussion details

Originally posted by dimabarbul January 12, 2024
We deploy Ditto (tried 3.4.0 and 3.4.4) using helm chart. Sometimes, for example, during cluster upgrade, we face following issue.

Things service loses connection to mongo (this is expected as during cluster upgrade mongo might be temporarily unavailable). But later when mongo is back online things service does not reconnect. In logs we can see errors com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'. But /alive and /ready health check endpoints return OK. Request to http://ditto-gateway:8080/health returns 503 as expected. We waited for more than an hour and the issue did not go away.

The only way that we found where we can see this error (apart from logs) is request to http://ditto-things:8080/status/health. The problem with this approach is that port 8080 is not declared in helm chart and thus is not accessible for k8s health probes.

Restarting the pod fixes the issue, but the problem is that this restart has to be manual as k8s does not see any issues.

Logs from things service that might be related {"@timestamp":"2024-01-11T12:52:00.028+01:00","@Version":"1","message":"Stream failed","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"ERROR","level_value":40000,"stack_trace":"java.util.concurrent.CompletionException: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:26)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:23)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:484)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:504)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:278)\n\tat scala.concurrent.Promise.tryFailure(Promise.scala:117)\n\tat scala.concurrent.Promise.tryFailure$(Promise.scala:117)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:104)\n\tat org.apache.pekko.stream.impl.SeqStage$$anon$3.onUpstreamFailure(Sinks.scala:274)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:537)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:400)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:662)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute(ActorGraphInterpreter.scala:71)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute$(ActorGraphInterpreter.scala:67)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$BatchingActorInputBoundary$OnError.execute(ActorGraphInterpreter.scala:92)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:637)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.org$apache$pekko$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:813)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:831)\n\tat org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)\n\tat org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:729)\n\tat org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)\n\tat org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)\n\tat org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)\n\tat org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)\n\tat kamon.instrumentation.executor.ExecutorInstrumentation$InstrumentedForkJoinPool$TimingRunnable.run(ExecutorInstrumentation.scala:698)\n\tat org.apache.pekko.dispatch.ForkJoinExecutorConfigurator$PekkoForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:57)\n\tat java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)\n\tat java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)\n\tat java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)\n\tat java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)\n\tat java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)\nCaused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-37","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:00.029+01:00","@Version":"1","message":"Stream failed or shutdown. Next stream in starting from <>","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"INFO","level_value":20000,"sourceThread":"ditto-cluster-pekko.actor.default-dispatcher-11","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:07.879+01:00","@Version":"1","message":"Sending the health of this system as requested: StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"See detailed messages for: persistence.\"]], children=[StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\"]], children=[], label=persistence]], label=null]","logger_name":"org.eclipse.ditto.internal.utils.health.status.StatusSupplierActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-6","level":"INFO","level_value":20000,"sourceThread":"Thread-38","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/statusSupplier"} {"@timestamp":"2024-01-11T12:52:12.009+01:00","@Version":"1","message":"Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]","logger_name":"org.eclipse.ditto.internal.utils.persistence.mongo.MongoHealthChecker","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-11","level":"ERROR","level_value":40000,"stack_trace":"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-39","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/healthCheckingActor/persistence"}

As discussed, it would be good to create class that would check subsystem health and could be used as Pekko management health check.

@thjaeckle I'm not sure about if the class should be included by default to liveness (or readiness) checks or this possibility should be just documented. Could you, please, advise?

@thjaeckle
Copy link
Member

@dimabarbul yes, I would by default include the subsystem health checks.
If they are not wanted, the configuration can be overwritten to exclude them.

@thjaeckle thjaeckle added the bug label Jan 18, 2024
@thjaeckle thjaeckle added this to the 3.5.0 milestone Jan 19, 2024
@thjaeckle thjaeckle moved this to Waiting for Approval in Ditto Planning Jan 23, 2024
@github-project-automation github-project-automation bot moved this from Waiting for Approval to Done in Ditto Planning Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants