Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway Health Check Endpoint stays healthy when Redis is not connected #2248

Closed
dijitali opened this issue May 2, 2019 · 2 comments
Closed

Comments

@dijitali
Copy link

dijitali commented May 2, 2019

Do you want to request a feature or report a bug?

Bug:

We have two Tyk Gateway instances running in Kubernetes configured to use a Redis cluster for storage, similar to the example:

  "storage": {
    "type": "redis",
    "host": "localhost",
    "port": 6379,
    "username": "",
    "password": "",
    "database": 0,
    "optimisation_max_idle": 2000,
    "optimisation_max_active": 4000
  },

What is the current behavior?

On cluster restart, one of the Tyk Gateway pods failed to successfully reconnect to Redis. It attempted to reconnect twice and then appeared to stop retrying (NB: IP addresses replaced with dummy entries):

time="02:44:41" level=error msg="Redis disconnected or error received, attempting to reconnect: read tcp 1.2.3.4:44424->5.6.7.8:6379: read: connection reset by peer" 
time="02:44:41" level=error msg="Redis disconnected or error received, attempting to reconnect: read tcp 1.2.3.4:44424->5.6.7.8:6379: read: connection reset by peer" 
time="02:44:41" level=error msg="Connection to Redis failed, reconnect in 10s" err="read tcp 1.2.3.4:44424->5.6.7.8:6379: read: connection reset by peer" 
time="02:44:51" level=warning msg=Reconnecting 
time="02:44:54" level=error msg="Connection to Redis failed, reconnect in 10s" err="dial tcp 9.10.11.12:6379: connect: no route to host" 
time="02:45:04" level=warning msg=Reconnecting 

After this point, requests to that Tyk Gateway instance failed to authenticate (presumably because the Gateway was disconnected from Redis and so didn't recognise the key):

"time="14:35:21" level=warning msg="Key not found in storage engine" err="key not found" inbound-key="********"
"time="14:35:21" level=info msg="Attempted access with non-existent key."

However the /hello Gateway health check endpoint continued to return HTTP 200 responses. Since we use this endpoint for Readiness and Liveness probes in Kubernetes, the instance continued to take traffic but returned HTTP 403 to all requests:

< HTTP/1.1 403 Forbidden
< Content-Type: application/json
< Date: xyz
< Content-Length: 57
<
{
    "error": "Access to this API has been disallowed"
}

What is the expected behavior?

Health Check Endpoint Should Fail

If the Tyk Gateway is configured to use Redis storage but cannot connect to the Redis cluster, it should not return a HTTP 200 response on the health check endpoint.

We can then use higher-level orchestration determine something is wrong and take action (take instance out of service, restart it and/or trigger an alert).

Gateway should continue to reconnect or terminate

If the Tyk Gateway is configured to use Redis storage, it should continue trying to reconnect more than twice (potentially with exponential backoff). If it is going to stop attempting to reconnect, it should terminate itself.

If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem

  1. Configure a Tyk Gateway instance to use a Redis cluster
  2. Configure an API to use Access Tokens
  3. Make the Redis cluster unavailable
  4. Observe that the Tyk Gateway loses connection and attempts to reconnect twice at 10s intervals
  5. Observe that the health check endpoint continues to return HTTP 200 responses

Which versions of Tyk affected by this issue? Did this work in previous versions of Tyk?

Observed in Tyk v2.7.4. Not yet tested on latest version (there appear to be some redis-related improvements in v2.7.7).

@adelowo
Copy link
Contributor

adelowo commented May 2, 2019

There's a PR for liveness probe at #2180

@dijitali
Copy link
Author

dijitali commented May 8, 2019

@adelowo awesome!

Duplicate of #857

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants