doc: Improve examples for k8s readiness probes. (#1759)

This reorganizes and adds explanation to the startup and readiness probes. Also, this updates our recommendation to only use a startup and liveness probe, but not a readiness probe under most circumstances. Fixes #1757
GoogleCloudPlatform · Apr 19, 2023 · cb4f1e7 · cb4f1e7
1 parent ec56eba
commit cb4f1e7
Show file tree

Hide file tree

Showing 2 changed files with 149 additions and 50 deletions.
diff --git a/examples/k8s-health-check/README.md b/examples/k8s-health-check/README.md
@@ -18,17 +18,18 @@ localhost with three endpoints:
 - `/startup`: Returns 200 status when the proxy has finished starting up.
 Otherwise returns 503 status.
 
-- `/readiness`: Returns 200 status when the proxy has started, has available
-connections if max connections have been set with the `--max-connections`
-flag, and when the proxy can connect to all registered instances. Otherwise,
-returns a 503 status. Optionally supports a min-ready query param (e.g.,
-`/readiness?min-ready=3`) where the proxy will return a 200 status if the
-proxy can connect successfully to at least min-ready number of instances. If
-min-ready exceeds the number of registered instances, returns a 400.
-
 - `/liveness`: Always returns 200 status. If this endpoint is not responding,
 the proxy is in a bad state and should be restarted.
 
+- `/readiness`: Returns 200 status when the proxy has started, has available
+  connections if max connections have been set with the `--max-connections`
+  flag, and when the proxy can connect to all registered instances. Otherwise,
+  returns a 503 status. Optionally supports a min-ready query param (e.g.,
+  `/readiness?min-ready=3`) where the proxy will return a 200 status if the
+  proxy can connect successfully to at least min-ready number of instances. If
+  min-ready exceeds the number of registered instances, returns a 400.
+
+
 To configure the address, use `--http-address`. To configure the port, use
 `--http-port`.
 
@@ -39,41 +40,41 @@ To configure the address, use `--http-address`. To configure the port, use
 # Recommended configurations for health check probes.
 # Probe parameters can be adjusted to best fit the requirements of your application.
 # For details, see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
-livenessProbe:
-  httpGet:
-    path: /liveness
-    port: 9090
-  # Number of seconds after the container has started before the first probe is scheduled. Defaults to 0.
-  # Not necessary when the startup probe is in use.
-  initialDelaySeconds: 0
-  # Frequency of the probe. Defaults to 10.
-  periodSeconds: 10
-  # Number of seconds after which the probe times out. Defaults to 1.
-  timeoutSeconds: 5
-  # Number of times the probe is allowed to fail before the transition from healthy to failure state.
-  # Defaults to 3.
-  failureThreshold: 1
-readinessProbe:
-  httpGet:
-    path: /readiness
-    port: 9090
-  initialDelaySeconds: 0
-  periodSeconds: 10
-  timeoutSeconds: 5
-  # Number of times the probe must report success to transition from failure to healthy state.
-  # Defaults to 1 for readiness probe.
-  successThreshold: 1
-  failureThreshold: 1
 startupProbe:
-  httpGet:
-    path: /startup
-    port: 9090
-  periodSeconds: 1
-  timeoutSeconds: 5
-  failureThreshold: 20
+   # We recommend adding a startup probe to the proxy sidecar
+   # container. This will ensure that service traffic will be routed to
+   # the pod only after the proxy has successfully started.
+   httpGet:
+      path: /startup
+      port: 9090
+   periodSeconds: 1
+   timeoutSeconds: 5
+   failureThreshold: 20
+livenessProbe:
+   # We recommend adding a liveness probe to the proxy sidecar container.
+   httpGet:
+      path: /liveness
+      port: 9090
+   # Number of seconds after the container has started before the first probe is scheduled. Defaults to 0.
+   # Not necessary when the startup probe is in use.
+   initialDelaySeconds: 0
+   # Frequency of the probe.
+   periodSeconds: 60
+   # Number of seconds after which the probe times out.
+   timeoutSeconds: 30
+   # Number of times the probe is allowed to fail before the transition
+   # from healthy to failure state.
+   #
+   # If periodSeconds = 60, 5 tries will result in five minutes of
+   # checks. The proxy starts to refresh a certificate five minutes
+   # before its expiration. If those five minutes lapse without a
+   # successful refresh, the liveness probe will fail and the pod will be
+   # restarted.
+   failureThreshold: 5
+# We do not recommend adding a readiness probe under most circumstances
 ```
 
-2. Add `-use_http_health_check` and `-health-check-port` (optional) to your
+2. Add `--http-address` and `--http-port` (optional) to your
    proxy container configuration under `command: `.
     > [proxy_with_http_health_check.yaml](proxy_with_http_health_check.yaml#L53-L76)
 
@@ -103,3 +104,83 @@ args:
   - "--port=<DB_PORT>"
   - "<INSTANCE_CONNECTION_NAME>"
 ```
+
+### Readiness Health Check Configuration
+
+For most common usage, adding a readiness healthcheck to the proxy sidecar 
+container is unnecessary. An improperly configured readiness check can degrade 
+the application's availability.
+
+The proxy readiness probe fails when (1) the proxy used all its available
+concurrent connections to a database, (2) the network connection
+to the database is interrupted, (3) the database server is unavailable due
+to a maintenance operation. These are transient states that usually resolve
+within a few seconds.
+
+Most applications are resilient to transient database connection failures, and
+do not need to be restarted. We recommend adding a readiness check to the
+application container instead of the proxy container. The application can be
+programmed to report whether it is ready to receive requests, and the healthcheck
+can be tuned to restart the pod when the application is permanently stuck. 
+
+You should use the proxy container's readiness probe when these circumstances
+should cause k8s to terminate the entire pod:
+
+- The proxy can't connect to the database instances.
+- The max number of connections are in use.
+
+When you do use the proxy pod's readiness probe, be sure to set the 
+`failureThreshold` and `periodSeconds` to avoid restarting the pod on frequent
+transient failures.
+
+### Readiness Health Check Examples
+
+The DBA team performs database fail-overs drills without notice. A
+batch job should fail if it cannot connect the database for 3 minutes. 
+Set the readiness check so that the pod will be terminated after 3 minutes
+of consecutive readiness check failures. (6 failed readiness checks taken every 30
+seconds, 6 x 30sec = 3 minutes.)
+
+```yaml
+readinessProbe:
+  httpGet:
+    path: /readiness
+    port: 9090
+  initialDelaySeconds: 30
+  # 30 sec period x 6 failures = 3 min until the pod is terminated
+  periodSeconds: 30
+  failureThreshold: 6
+  timeoutSeconds: 10
+  successThreshold: 1
+```
+
+
+A web application has a database connection pool leak and the 
+engineering team can't find the root cause. To keep the system running, 
+the application should be automatically restarted if it consumes 50 connections 
+for more than 1 minute.
+
+```yaml
+    containers:
+    - name: my-application
+      image: gcr.io/my-container/my-application:1.1
+    - name: cloud-sql-proxy
+      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.1.0
+      args:
+        # Set the --max-connections flag to 50
+        - "--max-connections"
+        - "50"
+        - "--port=<DB_PORT>"
+        - "<INSTANCE_CONNECTION_NAME>"
+# ...
+    readinessProbe:
+        httpGet:
+            path: /readiness
+            port: 9090
+        initialDelaySeconds: 10
+        # 5 sec period x 12 failures = 60 sec until the pod is terminated
+        periodSeconds: 5
+        failureThreshold: 12 
+        timeoutSeconds: 5
+        successThreshold: 1
+```
diff --git a/examples/k8s-health-check/proxy_with_http_health_check.yaml b/examples/k8s-health-check/proxy_with_http_health_check.yaml
@@ -99,7 +99,26 @@ spec:
         # Recommended configurations for health check probes.
         # Probe parameters can be adjusted to best fit the requirements of your application.
         # For details, see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
+        startupProbe:
+          # The /startup probe returns OK when the proxy is ready to receive
+          # connections from the application. In this example, k8s will check
+          # once a second for 20 seconds.
+          #
+          # We strongly recommend adding a startup probe to the proxy sidecar
+          # container. This will ensure that service traffic will be routed to
+          # the pod only after the proxy has successfully started.
+          httpGet:
+            path: /startup
+            port: 9090
+          periodSeconds: 1
+          timeoutSeconds: 5
+          failureThreshold: 20
         livenessProbe:
+          # The /liveness probe returns OK as soon as the proxy application has
+          # begun its startup process and continues to return OK until the
+          # process stops.
+          #
+          # We recommend adding a liveness probe to the proxy sidecar container.
           httpGet:
             path: /liveness
             port: 9090
@@ -120,23 +139,22 @@ spec:
           # restarted.
           failureThreshold: 5
         readinessProbe:
+          # The /readiness probe returns OK when the proxy can establish
+          # a new connections to its databases.
+          #
+          # Please use the readiness probe to the proxy sidecar with caution.
+          # An improperly configured readiness probe can cause unnecessary
+          # interruption to the application. See README.md for more detail.
           httpGet:
             path: /readiness
             port: 9090
-          initialDelaySeconds: 0
+          initialDelaySeconds: 10
           periodSeconds: 10
-          timeoutSeconds: 5
+          timeoutSeconds: 10
           # Number of times the probe must report success to transition from failure to healthy state.
           # Defaults to 1 for readiness probe.
           successThreshold: 1
-          failureThreshold: 1
-        startupProbe:
-          httpGet:
-            path: /startup
-            port: 9090
-          periodSeconds: 1
-          timeoutSeconds: 5
-          failureThreshold: 20
+          failureThreshold: 6
       volumes:
       - name: <YOUR-SA-SECRET-VOLUME>
         secret: