sql: `SHOW RANGES` does not work with offline node #106097

erikgrinaker · 2023-07-04T08:44:45Z

This works in 23.1 but not on master. To reproduce:

roachprod create local -n 3
roachprod start local
cockroach workload init kv --splits 1000
roachprod sql local:1

> show ranges from database kv with details;

roachprod stop local:3

> show ranges from database kv with details;
ERROR: failed to dial into node 3 (NODE_STATUS_UNAVAILABLE): initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp 10.166.0.3:26261: connect: connection refused" [code 14/Unavailable]

Jira issue: CRDB-29397
Epic: CRDB-30635

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-07-04T08:44:48Z

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

erikgrinaker · 2023-07-04T08:53:01Z

This is fallout from #103128, reverting it fixes the problem. It does work when I omit WITH DETAILS, and I get the same error with tenant_span_stats():

> select * from crdb_internal.tenant_span_stats(1);                                                                                                                                           
ERROR: failed to dial into node 3 (NODE_STATUS_DEAD): initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp 10.166.0.3:26261: connect: connection refused" [code 14/Unavailable]

I suppose the root problem here is in tenant_span_stats() rather than SHOW RANGES.

erikgrinaker · 2023-07-04T08:53:32Z

We should also have a test for this. SHOW RANGES needs to work when nodes are offline.

zachlite · 2023-07-05T16:38:19Z

@erikgrinaker thanks for filing. SHOW RANGES... will work fine on 23.1, because the work integrating span stats with show ranges (#103128) was not backported.

knz · 2023-07-19T09:16:26Z

I suppose the root problem here is in tenant_span_stats() rather than SHOW RANGES.

I feel like an acceptable behavior here is to configure a timeout in each node request during the fanout, and report "unknown stats" when a node cannot be reached.

erikgrinaker · 2023-07-19T09:19:48Z

That seems brittle? Why do we have to reach all cluster nodes, as long as we have valid leases for all ranges somewhere?

knz · 2023-07-19T10:02:13Z

Why do we have to reach all cluster nodes, as long as we have valid leases for all ranges somewhere?

It's a good question. There's room for optimization here.

Current algorithm (pseudo-code)

list the nodes that have at least one replica that overlaps with the target span(s) (nodeIDsForSpan)
on each node:
1. call getLocalStats with the list of spans known to have a replica on this node
2. for each span:
  - for those ranges that are fully contained in a span, use a KV batch containing a collection of RangeStatsRequest via distsender (which is expected to hit locally) - flushBatchedContainedKeys
  - for those ranges that overlap partially (first and last range), call storage.ComputeStats which is a local call.

Areas for improvement

In decreasing order of impact:

all this fan-out complexity is because storage.ComputeStats is a local call. If there was a KV request type that could handle ComputeStats, DistSender could own the entire networking here and the entire algorithm could collapse into a very small function.
it seems strange we hit all nodes with at least one replica. Should we not restrict the fanout to the leaseholders only?
we only use the distsender fast path inside the inner loop. The case of fully-included ranges should be hoisted at the top of the outer loop instead. DistSender can own the reaching out to leaseholder over the network.
the KV batch constructed by flushBatchedContainedKeys may grow to become arbitrary large. This should be cut up in smaller pieces (maybe 100 ranges at a time).

knz · 2023-07-19T10:05:51Z

For the very first point, Zach would probably need some support from the KV team.
In the past, Zach was working within the confine of existing code patterns for observability functions. Doing this optimization is really a KV API extension, which is a somewhat different design direction.

@erikgrinaker who in the team would be best to brainstorm here?

erikgrinaker · 2023-07-19T10:24:19Z

@shralex can assign this.

zachlite · 2023-07-19T14:08:31Z

Is there value in a short term solution where a request to an offline node just returns an empty SpanStatsResponse, instead of invalidating the entire fan-out?

knz · 2023-07-19T14:09:14Z

Is there value in a short term solution where a request to an offline node just returns an empty SpanStatsResponse, instead of invalidating the entire fan-out?

Yes, that would help a little bit.

knz · 2023-07-19T14:10:20Z

Actually, regardless of the architecture you choose (current or with optimizations discussed above) there's still a path through KV that can time out. I believe in both case it is useful to have a customizable time out to retrieve partial data.

This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. Errors that are encountered are logged. Errors may be due to a connection error, or due to a KV-related error while a node is servicing a request. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Added a new cluster setting, 'server.span_stats.node.timeout' to control the maximum duration that a node is allowed to spend servicing a span stats request. A value of '0' will not timeout.

This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.

108456: server: make the span stats fan-out more fault tolerant r=zachlite a=zachlite This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves #106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response. Co-authored-by: zachlite <[email protected]>

This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cluster-observability labels Jul 4, 2023

erikgrinaker added the branch-master Failures and bugs on the master branch. label Jul 4, 2023

erikgrinaker assigned zachlite Jul 4, 2023

erikgrinaker changed the title ~~server: SHOW RANGES does not work with offline node~~ sql: SHOW RANGES does not work with offline node Jul 4, 2023

erikgrinaker mentioned this issue Jul 5, 2023

kvserver: transfer lease when acquiring lease outside preferences #106079

Closed

erikgrinaker mentioned this issue Jul 19, 2023

release-23.1: sql: extend SHOW RANGES to include object size estimates #105306

Closed

zachlite mentioned this issue Aug 9, 2023

server: make the span stats fan-out more fault tolerant #108456

Merged

craig bot closed this as completed in 5ddcdd1 Aug 15, 2023

knz mentioned this issue Aug 15, 2023

sql: "span stats" algorithm is sub-optimal in a way that causes correctness issues #108779

Open

zachlite mentioned this issue Aug 18, 2023

release-23.1: server: make the span stats fan-out more fault tolerant #109008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: `SHOW RANGES` does not work with offline node #106097

sql: `SHOW RANGES` does not work with offline node #106097

erikgrinaker commented Jul 4, 2023 •

edited by knz

Loading

blathers-crl bot commented Jul 4, 2023

erikgrinaker commented Jul 4, 2023 •

edited

Loading

erikgrinaker commented Jul 4, 2023

zachlite commented Jul 5, 2023

knz commented Jul 19, 2023

erikgrinaker commented Jul 19, 2023 •

edited

Loading

knz commented Jul 19, 2023

knz commented Jul 19, 2023

erikgrinaker commented Jul 19, 2023

zachlite commented Jul 19, 2023

knz commented Jul 19, 2023

knz commented Jul 19, 2023

sql: SHOW RANGES does not work with offline node #106097

sql: SHOW RANGES does not work with offline node #106097

Comments

erikgrinaker commented Jul 4, 2023 • edited by knz Loading

blathers-crl bot commented Jul 4, 2023

erikgrinaker commented Jul 4, 2023 • edited Loading

erikgrinaker commented Jul 4, 2023

zachlite commented Jul 5, 2023

knz commented Jul 19, 2023

erikgrinaker commented Jul 19, 2023 • edited Loading

knz commented Jul 19, 2023

Current algorithm (pseudo-code)

Areas for improvement

knz commented Jul 19, 2023

erikgrinaker commented Jul 19, 2023

zachlite commented Jul 19, 2023

knz commented Jul 19, 2023

knz commented Jul 19, 2023

sql: `SHOW RANGES` does not work with offline node #106097

sql: `SHOW RANGES` does not work with offline node #106097

erikgrinaker commented Jul 4, 2023 •

edited by knz

Loading

erikgrinaker commented Jul 4, 2023 •

edited

Loading

erikgrinaker commented Jul 19, 2023 •

edited

Loading