Gateway Health Check Endpoint stays healthy when Redis is not connected #2248

dijitali · 2019-05-02T11:29:22Z

Do you want to request a feature or report a bug?

Bug:

We have two Tyk Gateway instances running in Kubernetes configured to use a Redis cluster for storage, similar to the example:

  "storage": {
    "type": "redis",
    "host": "localhost",
    "port": 6379,
    "username": "",
    "password": "",
    "database": 0,
    "optimisation_max_idle": 2000,
    "optimisation_max_active": 4000
  },

What is the current behavior?

On cluster restart, one of the Tyk Gateway pods failed to successfully reconnect to Redis. It attempted to reconnect twice and then appeared to stop retrying (NB: IP addresses replaced with dummy entries):

time="02:44:41" level=error msg="Redis disconnected or error received, attempting to reconnect: read tcp 1.2.3.4:44424->5.6.7.8:6379: read: connection reset by peer" 
time="02:44:41" level=error msg="Redis disconnected or error received, attempting to reconnect: read tcp 1.2.3.4:44424->5.6.7.8:6379: read: connection reset by peer" 
time="02:44:41" level=error msg="Connection to Redis failed, reconnect in 10s" err="read tcp 1.2.3.4:44424->5.6.7.8:6379: read: connection reset by peer" 
time="02:44:51" level=warning msg=Reconnecting 
time="02:44:54" level=error msg="Connection to Redis failed, reconnect in 10s" err="dial tcp 9.10.11.12:6379: connect: no route to host" 
time="02:45:04" level=warning msg=Reconnecting

After this point, requests to that Tyk Gateway instance failed to authenticate (presumably because the Gateway was disconnected from Redis and so didn't recognise the key):

"time="14:35:21" level=warning msg="Key not found in storage engine" err="key not found" inbound-key="********"
"time="14:35:21" level=info msg="Attempted access with non-existent key."

However the /hello Gateway health check endpoint continued to return HTTP 200 responses. Since we use this endpoint for Readiness and Liveness probes in Kubernetes, the instance continued to take traffic but returned HTTP 403 to all requests:

< HTTP/1.1 403 Forbidden
< Content-Type: application/json
< Date: xyz
< Content-Length: 57
<
{
    "error": "Access to this API has been disallowed"
}

What is the expected behavior?

Health Check Endpoint Should Fail

If the Tyk Gateway is configured to use Redis storage but cannot connect to the Redis cluster, it should not return a HTTP 200 response on the health check endpoint.

We can then use higher-level orchestration determine something is wrong and take action (take instance out of service, restart it and/or trigger an alert).

Gateway should continue to reconnect or terminate

If the Tyk Gateway is configured to use Redis storage, it should continue trying to reconnect more than twice (potentially with exponential backoff). If it is going to stop attempting to reconnect, it should terminate itself.

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem

Configure a Tyk Gateway instance to use a Redis cluster
Configure an API to use Access Tokens
Make the Redis cluster unavailable
Observe that the Tyk Gateway loses connection and attempts to reconnect twice at 10s intervals
Observe that the health check endpoint continues to return HTTP 200 responses

Which versions of Tyk affected by this issue? Did this work in previous versions of Tyk?

Observed in Tyk v2.7.4. Not yet tested on latest version (there appear to be some redis-related improvements in v2.7.7).

The text was updated successfully, but these errors were encountered:

adelowo · 2019-05-02T11:43:24Z

There's a PR for liveness probe at #2180

dijitali · 2019-05-08T16:59:10Z

@adelowo awesome!

Duplicate of #857

dijitali closed this as completed May 8, 2019

vanny96 mentioned this issue Oct 28, 2024

Tyk's health check always returns 200 #6674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway Health Check Endpoint stays healthy when Redis is not connected #2248

Gateway Health Check Endpoint stays healthy when Redis is not connected #2248

dijitali commented May 2, 2019

adelowo commented May 2, 2019

dijitali commented May 8, 2019 •

edited

Loading

Gateway Health Check Endpoint stays healthy when Redis is not connected #2248

Gateway Health Check Endpoint stays healthy when Redis is not connected #2248

Comments

dijitali commented May 2, 2019

Do you want to request a feature or report a bug?

What is the current behavior?

What is the expected behavior?

Health Check Endpoint Should Fail

Gateway should continue to reconnect or terminate

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem

Which versions of Tyk affected by this issue? Did this work in previous versions of Tyk?

adelowo commented May 2, 2019

dijitali commented May 8, 2019 • edited Loading

dijitali commented May 8, 2019 •

edited

Loading