Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui, server: hot ranges page hits context deadline exceeded #104269

Open
zachlite opened this issue Jun 2, 2023 · 0 comments
Open

ui, server: hot ranges page hits context deadline exceeded #104269

zachlite opened this issue Jun 2, 2023 · 0 comments
Labels
A-check-on-console Issues that need to be checked on CC Console C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability

Comments

@zachlite
Copy link
Contributor

zachlite commented Jun 2, 2023

The Hot Ranges page can time out if the target cluster has a large number of nodes (45 nodes in the case of this reported error).

There was work done to improve the performance of Hot Ranges requests in af62e80 as a part of #74377 by adding pagination.

We might consider a pagination scheme that visits 1 node at a time, and streams that data back to the client. Hot Range data doesn't require some final aggregation or filtering, so we don't need to wait for a cluster-wide fan-out to complete.

Jira issue: CRDB-28439

@zachlite zachlite added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cluster-observability labels Jun 2, 2023
craig bot pushed a commit that referenced this issue Jul 24, 2023
107457: ui: increase hot ranges page timeout r=zachlite a=zachlite

This commit increases the hot ranges request timeout to 30 minutes for both the initial fetch and the refresh.

Informs ##104269
Epic: none
Release note (bug fix): The timeout duration when loading the
Hot Ranges page has been increased to 30 minutes.

107468: kv: implement errors.Wrapper on sendError, deflake test r=knz a=nvanbenschoten

Fixes #107353.

This commit makes `sendError` implement the `errors.Wrapper` interface. This deflakes `TestDefaultConnectionDisruptionDoesNotInterfereWithSystemTraffic`, which was expecting a call to `require.ErrorIs` to find a `context.DeadlineExceeded` in an error chain that included a `sendError`.

Release note: None

Co-authored-by: zachlite <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
zachlite added a commit to zachlite/cockroach that referenced this issue Jul 28, 2023
Requests for hot ranges are serviced by a cluster wide fan-out,
where non-trivial work is done on each node to provide a response.
For each store, and for each hot range, we start a transaction with KV to look
up descriptor info.

Previously, there was no upper-bound set on the time a node could take
to provide a response. This commit introduces a per-node timeout
in the pagination logic, and is configurable with the new cluster setting
server.hot_ranges.node.timeout. A value of 0 will disable the timeout.

Error behavior and semantics are preserved. If a particular node times out,
The fan-out continues as before, as if a node failed to provide a response.

Informs cockroachdb#104269
Resolves cockroachdb#107627
Epic: none
Release note (ops change): Added a new cluster setting named
server.hot_ranges.node.timeout, with a default value of 5 minutes.
The setting controls the maximum amount of time that a hot ranges request
will spend waiting for a node to provide a response.
Set to 0 to disable timeouts.
zachlite added a commit to zachlite/cockroach that referenced this issue Jul 31, 2023
Requests for hot ranges are serviced by a cluster wide fan-out,
where non-trivial work is done on each node to provide a response.
For each store, and for each hot range, we start a transaction with KV to look
up descriptor info.

Previously, there was no upper-bound set on the time a node could take
to provide a response. This commit introduces a per-node timeout
in the pagination logic, and is configurable with the new cluster setting
server.hot_ranges_request.node.timeout. A value of 0 will disable the timeout.

Error behavior and semantics are preserved. If a particular node times out,
The fan-out continues as before, as if a node failed to provide a response.

Informs cockroachdb#104269
Resolves cockroachdb#107627
Epic: none
Release note (ops change): Added a new cluster setting named
server.hot_ranges_request.node.timeout, with a default value of 5 minutes.
The setting controls the maximum amount of time that a hot ranges request
will spend waiting for a node to provide a response.
Set to 0 to disable timeouts.
zachlite added a commit to zachlite/cockroach that referenced this issue Jul 31, 2023
Requests for hot ranges are serviced by a cluster wide fan-out,
where non-trivial work is done on each node to provide a response.
For each store, and for each hot range, we start a transaction with KV to look
up descriptor info.

Previously, there was no upper-bound set on the time a node could take
to provide a response. This commit introduces a per-node timeout
in the pagination logic, and is configurable with the new cluster setting
server.hot_ranges_request.node.timeout. A value of 0 will disable the timeout.

Error behavior and semantics are preserved. If a particular node times out,
The fan-out continues as before, as if a node failed to provide a response.

Informs cockroachdb#104269
Resolves cockroachdb#107627
Epic: none
Release note (ops change): Added a new cluster setting named
server.hot_ranges_request.node.timeout, with a default value of 5 minutes.
The setting controls the maximum amount of time that a hot ranges request
will spend waiting for a node to provide a response.
Set to 0 to disable timeouts.
craig bot pushed a commit that referenced this issue Aug 1, 2023
107796: ui, server: add a timeout per node while collecting hot ranges r=zachlite a=zachlite

Requests for hot ranges are serviced by a cluster wide fan-out, where non-trivial work is done on each node to provide a response. For each store, and for each hot range, we start a transaction with KV to look up descriptor info.

Previously, there was no upper-bound set on the time a node could take to provide a response. This commit introduces a per-node timeout in the pagination logic, and is configurable with the new cluster setting server.hot_ranges.node.timeout. A value of 0 will disable the timeout.

Error behavior and semantics are preserved. If a particular node times out, The fan-out continues as before, as if a node failed to provide a response.

Informs #104269
Resolves #107627 
Epic: none
Release note (ops change): Added a new cluster setting named server.hot_ranges.node.timeout, with a default value of 5 minutes. The setting controls the maximum amount of time that a hot ranges request will spend waiting for a node to provide a response. Set to 0 to disable timeouts.

Co-authored-by: zachlite <[email protected]>
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 18, 2023
Requests for hot ranges are serviced by a cluster wide fan-out,
where non-trivial work is done on each node to provide a response.
For each store, and for each hot range, we start a transaction with KV to look
up descriptor info.

Previously, there was no upper-bound set on the time a node could take
to provide a response. This commit introduces a per-node timeout
in the pagination logic, and is configurable with the new cluster setting
server.hot_ranges_request.node.timeout. A value of 0 will disable the timeout.

Error behavior and semantics are preserved. If a particular node times out,
The fan-out continues as before, as if a node failed to provide a response.

Informs cockroachdb#104269
Resolves cockroachdb#107627
Epic: none
Release note (ops change): Added a new cluster setting named
server.hot_ranges_request.node.timeout, with a default value of 5 minutes.
The setting controls the maximum amount of time that a hot ranges request
will spend waiting for a node to provide a response.
Set to 0 to disable timeouts.
@maryliag maryliag added the A-check-on-console Issues that need to be checked on CC Console label Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-check-on-console Issues that need to be checked on CC Console C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-observability
Projects
None yet
Development

No branches or pull requests

2 participants