Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many Redis requests are timing out and throwing errors #3621

Closed
Tracked by #3704
nathanmarcos opened this issue Aug 21, 2023 · 4 comments · Fixed by #3817
Closed
Tracked by #3704

Many Redis requests are timing out and throwing errors #3621

nathanmarcos opened this issue Aug 21, 2023 · 4 comments · Fixed by #3817
Assignees

Comments

@nathanmarcos
Copy link

Describe the bug
After fixing the latest Redis outage (#3509) on version 1.27.0, the application started throwing millions of mget error: Timeout Error: Request timed out.. It seems that this is caused when small changes in latency between Router and Redis requests last longer than the default timeout of 1ms.

To Reproduce
Steps to reproduce the behavior:

  1. Run Redis instance
  2. Run Router with APQ set up with Redis
  3. Submit request with persistedQuery hash in the extensions
  4. See errors on router's terminal:
2023-08-21T10:32:44.655911Z  ERROR [trace_id=84abf715700bdc7accbbe5bf7cac366a] mget error: Timeout Error: Request timed out.
2023-08-21T10:32:44.676982Z  ERROR [trace_id=c18fb3e62220025ab1858ee3295446d3] mget error: Timeout Error: Request timed out.
2023-08-21T10:32:44.678350Z  ERROR [trace_id=776705431a03a9d0adda5ac8eb32380c] mget error: Timeout Error: Request timed out.

Expected behavior
Errors should not happen.

@o0Ignition0o
Copy link
Contributor

It looks like the timeout is set to 1ms and not configurable, @Geal do you have any opinions on that?

                default_command_timeout_ms: 1,

@Geal
Copy link
Contributor

Geal commented Aug 22, 2023

That default option is too optimistic. Pergaps that should be configurable too, because it depends a lot on the production environment

@o0Ignition0o
Copy link
Contributor

An other thing that eludes me at the moment is @nathanmarcos didn't see these errors before, is it because redis disconnected and didn't try to reconnect?

@nathanmarcos
Copy link
Author

@o0Ignition0o, I think that’s not related since these errors occur when Redis is up and running. When the Router loses its connection to Redis, other errors in the logs become more frequent.

Example when you simply kill Redis when Router is running:

2023-08-22T17:31:36.846418Z  ERROR Client disconnected with error: Redis Error - kind: Canceled, details: Canceled.
2023-08-22T17:31:36.851909Z  ERROR Client disconnected with error: Redis Error - kind: IO, details: Os { code: 61, kind: ConnectionRefused, message: "Connection refused" }

Then when you spin it up again:

2023-08-22T17:33:10.182223Z  INFO Redis client reconnected.

I only copied one of each type of these logs, but actually, there were dozens of them.

@bnjjj bnjjj self-assigned this Sep 13, 2023
bnjjj added a commit that referenced this issue Sep 18, 2023
It adds a configuration to set another timeout than the default one
(2ms) for redis requests. It also change the default timeout to 2ms
(previously set to 1ms)

Example for APQ:
```yaml
apq:
  router:
    cache:
      redis:
        urls: ["redis://..."] 
        timeout: 5ms
```

Fixes #3621

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [x] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [x] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

---------

Signed-off-by: Benjamin Coenen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants