/alive and /ready health checks do not consider subsystem health #1865

dimabarbul · 2024-01-17T06:06:34Z

Discussed in #1861

Discussion details

^{Originally posted by dimabarbul January 12, 2024}
We deploy Ditto (tried 3.4.0 and 3.4.4) using helm chart. Sometimes, for example, during cluster upgrade, we face following issue.

Things service loses connection to mongo (this is expected as during cluster upgrade mongo might be temporarily unavailable). But later when mongo is back online things service does not reconnect. In logs we can see errors com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'. But /alive and /ready health check endpoints return OK. Request to http://ditto-gateway:8080/health returns 503 as expected. We waited for more than an hour and the issue did not go away.

The only way that we found where we can see this error (apart from logs) is request to http://ditto-things:8080/status/health. The problem with this approach is that port 8080 is not declared in helm chart and thus is not accessible for k8s health probes.

Restarting the pod fixes the issue, but the problem is that this restart has to be manual as k8s does not see any issues.

Logs from things service that might be related

{"@timestamp":"2024-01-11T12:52:00.028+01:00","@Version":"1","message":"Stream failed","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"ERROR","level_value":40000,"stack_trace":"java.util.concurrent.CompletionException: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)\n\tat java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:347)\n\tat java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:636)\n\tat java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)\n\tat java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:26)\n\tat scala.concurrent.impl.FutureConvertersImpl$CF.apply(FutureConvertersImpl.scala:23)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:484)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$Transformation.run(Promise.scala:504)\n\tat scala.concurrent.ExecutionContext$parasitic$.execute(ExecutionContext.scala:222)\n\tat scala.concurrent.impl.Promise$Transformation.submitWithValue(Promise.scala:429)\n\tat scala.concurrent.impl.Promise$DefaultPromise.submitWithValue(Promise.scala:338)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete0(Promise.scala:285)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:278)\n\tat scala.concurrent.Promise.tryFailure(Promise.scala:117)\n\tat scala.concurrent.Promise.tryFailure$(Promise.scala:117)\n\tat scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:104)\n\tat org.apache.pekko.stream.impl.SeqStage$$anon$3.onUpstreamFailure(Sinks.scala:274)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:537)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:400)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:662)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute(ActorGraphInterpreter.scala:71)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$SimpleBoundaryEvent.execute$(ActorGraphInterpreter.scala:67)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$BatchingActorInputBoundary$OnError.execute(ActorGraphInterpreter.scala:92)\n\tat org.apache.pekko.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:637)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.org$apache$pekko$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:813)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:831)\n\tat org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)\n\tat org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)\n\tat org.apache.pekko.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:729)\n\tat org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)\n\tat org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)\n\tat org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)\n\tat org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)\n\tat kamon.instrumentation.executor.ExecutorInstrumentation$InstrumentedForkJoinPool$TimingRunnable.run(ExecutorInstrumentation.scala:698)\n\tat org.apache.pekko.dispatch.ForkJoinExecutorConfigurator$PekkoForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:57)\n\tat java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)\n\tat java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)\n\tat java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)\n\tat java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)\n\tat java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)\nCaused by: com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-37","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:00.029+01:00","@Version":"1","message":"Stream failed or shutdown. Next stream in starting from <>","logger_name":"org.eclipse.ditto.internal.utils.persistentactors.cleanup.PersistenceCleanupActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-14","level":"INFO","level_value":20000,"sourceThread":"ditto-cluster-pekko.actor.default-dispatcher-11","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/persistenceCleanup"} {"@timestamp":"2024-01-11T12:52:07.879+01:00","@Version":"1","message":"Sending the health of this system as requested: StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"See detailed messages for: persistence.\"]], children=[StatusInfo [status=DOWN, details=[StatusDetailMessage [level=ERROR, message=\"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\"]], children=[], label=persistence]], label=null]","logger_name":"org.eclipse.ditto.internal.utils.health.status.StatusSupplierActor","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-6","level":"INFO","level_value":20000,"sourceThread":"Thread-38","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/statusSupplier"} {"@timestamp":"2024-01-11T12:52:12.009+01:00","@Version":"1","message":"Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]","logger_name":"org.eclipse.ditto.internal.utils.persistence.mongo.MongoHealthChecker","thread_name":"ditto-cluster-pekko.actor.default-dispatcher-11","level":"ERROR","level_value":40000,"stack_trace":"com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, srvResolutionException=com.mongodb.MongoConfigurationException: Failed looking up SRV record for '_mongodb._tcp.default-psmdb-db-rs0.databases.svc.cluster.local'., servers=[]\n\tat com.mongodb.internal.connection.BaseCluster.createTimeoutException(BaseCluster.java:430)\n\tat com.mongodb.internal.connection.BaseCluster.handleServerSelectionRequest(BaseCluster.java:325)\n\tat com.mongodb.internal.connection.BaseCluster.access$700(BaseCluster.java:69)\n\tat com.mongodb.internal.connection.BaseCluster$WaitQueueHandler.run(BaseCluster.java:505)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n","sourceThread":"Thread-39","pekkoAddress":"pekko://[email protected]:2551","pekkoSource":"pekko://ditto-cluster/user/thingsRoot/healthCheckingActor/persistence"}

As discussed, it would be good to create class that would check subsystem health and could be used as Pekko management health check.

@thjaeckle I'm not sure about if the class should be included by default to liveness (or readiness) checks or this possibility should be just documented. Could you, please, advise?

The text was updated successfully, but these errors were encountered:

thjaeckle · 2024-01-17T06:09:05Z

@dimabarbul yes, I would by default include the subsystem health checks.
If they are not wanted, the configuration can be overwritten to exclude them.

thjaeckle added the bug label Jan 18, 2024

dimabarbul mentioned this issue Jan 19, 2024

Add subsystem health check to liveness checks #1867

Merged

thjaeckle added this to the 3.5.0 milestone Jan 19, 2024

thjaeckle assigned dimabarbul Jan 19, 2024

thjaeckle added this to Ditto Planning Jan 19, 2024

thjaeckle moved this to Waiting for Approval in Ditto Planning Jan 23, 2024

thjaeckle closed this as completed in #1867 Jan 23, 2024

github-project-automation bot moved this from Waiting for Approval to Done in Ditto Planning Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/alive and /ready health checks do not consider subsystem health #1865

/alive and /ready health checks do not consider subsystem health #1865

dimabarbul commented Jan 17, 2024

thjaeckle commented Jan 17, 2024

/alive and /ready health checks do not consider subsystem health #1865

/alive and /ready health checks do not consider subsystem health #1865

Comments

dimabarbul commented Jan 17, 2024

Discussed in #1861

thjaeckle commented Jan 17, 2024