-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: SHOW RANGES
does not work with offline node
#106097
Comments
Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This is fallout from #103128, reverting it fixes the problem. It does work when I omit
I suppose the root problem here is in |
We should also have a test for this. SHOW RANGES needs to work when nodes are offline. |
SHOW RANGES
does not work with offline nodeSHOW RANGES
does not work with offline node
@erikgrinaker thanks for filing. |
I feel like an acceptable behavior here is to configure a timeout in each node request during the fanout, and report "unknown stats" when a node cannot be reached. |
That seems brittle? Why do we have to reach all cluster nodes, as long as we have valid leases for all ranges somewhere? |
It's a good question. There's room for optimization here. Current algorithm (pseudo-code)
Areas for improvementIn decreasing order of impact:
|
For the very first point, Zach would probably need some support from the KV team. @erikgrinaker who in the team would be best to brainstorm here? |
@shralex can assign this. |
Is there value in a short term solution where a request to an offline node just returns an empty |
Yes, that would help a little bit. |
Actually, regardless of the architecture you choose (current or with optimizations discussed above) there's still a path through KV that can time out. I believe in both case it is useful to have a customizable time out to retrieve partial data. |
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. Errors that are encountered are logged. Errors may be due to a connection error, or due to a KV-related error while a node is servicing a request. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Added a new cluster setting, 'server.span_stats.node.timeout' to control the maximum duration that a node is allowed to spend servicing a span stats request. A value of '0' will not timeout.
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
108456: server: make the span stats fan-out more fault tolerant r=zachlite a=zachlite This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves #106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response. Co-authored-by: zachlite <[email protected]>
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
This commit adds improved fault tolerance to the span stats fan-out: 1. Errors encountered during the fan-out will not invalidate the entire request. Now, a span stats fan-out will always return a roachpb.SpanStatsResponse that has been updated by values from nodes that service their requests without error. In the extreme case where there's a failure encountered on every node, an empty response is returned. Errors that are encountered are logged, and then appended to the response in the newly added `Errors` field. 2. Nodes must service requests within the timeout passed to `iterateNodes`. For span stats, the value comes from a new cluster setting: 'server.span_stats.node.timeout', with a default value of 1 minute. Resolves cockroachdb#106097 Epic: none Release note (ops change): Span stats requests will return a partial result if the request encounters any errors. Errors that would have previously terminated the request are now included in the response.
This works in 23.1 but not on
master
. To reproduce:Jira issue: CRDB-29397
Epic: CRDB-30635
The text was updated successfully, but these errors were encountered: