Refactor application health monitoring #82

istreeter · 2024-08-13T17:27:05Z

Application health is needed for two distinct reasons:

Sending alerts to a monitoring webhook for setup errors. Relatedly, sending heartbeat messages to the webhook when healthy.
The HTTP health probe. Needed so that the orchestration environment (kubernetes or whatever) can kill the pod when unhealthy

Several Snowplow apps each implement their own logic of when to toggle the health probe and when to send an alert. This PR consolidates that logic into one place.

After this PR, the application code just needs to call methods on the AppHealth class. This lib then manages webhook events and health probe, based on the current status of the AppHealth.

Application health is needed for two distinct reasons: 1. Sending alerts to a monitoring webhook for setup errors. Relatedly, sending heartbeat messages to the webhook when healthy. 2. The HTTP health probe. Needed so that the orchestration environment (kubernetes or whatever) can kill the pod when unhealthy Several Snowplow apps each implement their own logic of when to toggle the health probe and when to send an alert. This PR consolidates that logic into one place. After this PR, the application code just needs to call methods on the AppHealth class. This lib then manages webhook events and health probe, based on the current status of the AppHealth.

modules/runtime-common/src/main/scala/com/snowplowanalytics/snowplow/runtime/AppHealth.scala

pondzix · 2024-08-14T13:00:00Z

modules/runtime-common/src/main/scala/com/snowplowanalytics/snowplow/runtime/AppHealth.scala

+
+  private[runtime] sealed trait SetupStatus[+Alert]
+  private[runtime] object SetupStatus {
+    case object AwaitingHealth extends SetupStatus[Nothing]


It only affects what is printed in logs, right? In health probe it still means 503 unhealthy?

When we first transition to Healthy then we send a heartbeat event to the webhook

When we first transition to Unhealthy then we send a alert event to the webhook

Therefore we need a type which is neither Healthy nor Unhealthy so we detect the transition. Hence AwaitingHealth.

modules/runtime-common/src/main/scala/com/snowplowanalytics/snowplow/runtime/HealthProbe.scala

modules/runtime-common/src/main/scala/com/snowplowanalytics/snowplow/runtime/Webhook.scala

istreeter force-pushed the feature/application-health-monitoring branch from 4b3a25f to b29df23 Compare August 13, 2024 17:37

[amendment] tests to validate the heartbeat schema

d8b7bb0