Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: SHOW RANGES does not work with offline node #106097

Closed
erikgrinaker opened this issue Jul 4, 2023 · 12 comments · Fixed by #108456
Closed

sql: SHOW RANGES does not work with offline node #106097

erikgrinaker opened this issue Jul 4, 2023 · 12 comments · Fixed by #108456
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 4, 2023

This works in 23.1 but not on master. To reproduce:

roachprod create local -n 3
roachprod start local
cockroach workload init kv --splits 1000
roachprod sql local:1

> show ranges from database kv with details;

roachprod stop local:3

> show ranges from database kv with details;
ERROR: failed to dial into node 3 (NODE_STATUS_UNAVAILABLE): initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp 10.166.0.3:26261: connect: connection refused" [code 14/Unavailable]

Jira issue: CRDB-29397
Epic: CRDB-30635

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cluster-observability labels Jul 4, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 4, 2023

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@erikgrinaker erikgrinaker added the branch-master Failures and bugs on the master branch. label Jul 4, 2023
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 4, 2023

This is fallout from #103128, reverting it fixes the problem. It does work when I omit WITH DETAILS, and I get the same error with tenant_span_stats():

> select * from crdb_internal.tenant_span_stats(1);                                                                                                                                           
ERROR: failed to dial into node 3 (NODE_STATUS_DEAD): initial connection heartbeat failed: grpc: connection error: desc = "transport: error while dialing: dial tcp 10.166.0.3:26261: connect: connection refused" [code 14/Unavailable]

I suppose the root problem here is in tenant_span_stats() rather than SHOW RANGES.

@erikgrinaker
Copy link
Contributor Author

We should also have a test for this. SHOW RANGES needs to work when nodes are offline.

@erikgrinaker erikgrinaker changed the title server: SHOW RANGES does not work with offline node sql: SHOW RANGES does not work with offline node Jul 4, 2023
@zachlite
Copy link
Contributor

zachlite commented Jul 5, 2023

@erikgrinaker thanks for filing. SHOW RANGES... will work fine on 23.1, because the work integrating span stats with show ranges (#103128) was not backported.

@knz
Copy link
Contributor

knz commented Jul 19, 2023

I suppose the root problem here is in tenant_span_stats() rather than SHOW RANGES.

I feel like an acceptable behavior here is to configure a timeout in each node request during the fanout, and report "unknown stats" when a node cannot be reached.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 19, 2023

That seems brittle? Why do we have to reach all cluster nodes, as long as we have valid leases for all ranges somewhere?

@knz
Copy link
Contributor

knz commented Jul 19, 2023

Why do we have to reach all cluster nodes, as long as we have valid leases for all ranges somewhere?

It's a good question. There's room for optimization here.

Current algorithm (pseudo-code)

  1. list the nodes that have at least one replica that overlaps with the target span(s) (nodeIDsForSpan)
  2. on each node:
    1. call getLocalStats with the list of spans known to have a replica on this node
    2. for each span:
      • for those ranges that are fully contained in a span, use a KV batch containing a collection of RangeStatsRequest via distsender (which is expected to hit locally) - flushBatchedContainedKeys
      • for those ranges that overlap partially (first and last range), call storage.ComputeStats which is a local call.

Areas for improvement

In decreasing order of impact:

  • all this fan-out complexity is because storage.ComputeStats is a local call. If there was a KV request type that could handle ComputeStats, DistSender could own the entire networking here and the entire algorithm could collapse into a very small function.
  • it seems strange we hit all nodes with at least one replica. Should we not restrict the fanout to the leaseholders only?
  • we only use the distsender fast path inside the inner loop. The case of fully-included ranges should be hoisted at the top of the outer loop instead. DistSender can own the reaching out to leaseholder over the network.
  • the KV batch constructed by flushBatchedContainedKeys may grow to become arbitrary large. This should be cut up in smaller pieces (maybe 100 ranges at a time).

@knz
Copy link
Contributor

knz commented Jul 19, 2023

For the very first point, Zach would probably need some support from the KV team.
In the past, Zach was working within the confine of existing code patterns for observability functions. Doing this optimization is really a KV API extension, which is a somewhat different design direction.

@erikgrinaker who in the team would be best to brainstorm here?

@erikgrinaker
Copy link
Contributor Author

@shralex can assign this.

@zachlite
Copy link
Contributor

Is there value in a short term solution where a request to an offline node just returns an empty SpanStatsResponse, instead of invalidating the entire fan-out?

@knz
Copy link
Contributor

knz commented Jul 19, 2023

Is there value in a short term solution where a request to an offline node just returns an empty SpanStatsResponse, instead of invalidating the entire fan-out?

Yes, that would help a little bit.

@knz
Copy link
Contributor

knz commented Jul 19, 2023

Actually, regardless of the architecture you choose (current or with optimizations discussed above) there's still a path through KV that can time out. I believe in both case it is useful to have a customizable time out to retrieve partial data.

zachlite added a commit to zachlite/cockroach that referenced this issue Aug 9, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error.

Errors that are encountered are logged. Errors may be due to a connection
error, or due to a KV-related error while a node is servicing a request.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Added a new cluster setting,
'server.span_stats.node.timeout' to control the maximum duration that
a node is allowed to spend servicing a span stats request. A value of
'0' will not timeout.
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 10, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 11, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 14, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 14, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
craig bot pushed a commit that referenced this issue Aug 15, 2023
108456: server: make the span stats fan-out more fault tolerant r=zachlite a=zachlite

This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

    Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves #106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.

Co-authored-by: zachlite <[email protected]>
@craig craig bot closed this as completed in 5ddcdd1 Aug 15, 2023
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 18, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
zachlite added a commit to zachlite/cockroach that referenced this issue Aug 21, 2023
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants