-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvprober: allow a manual probe of the entire keyspace #61695
Comments
@tbg, how useful do you think this would be to you, KV folks, support, etc.? Also do you have any concerns re: safety? In a really really high number of range cluster, perhaps the interface as you have written it could lead to overloading it temporarily? If useful, another idea (not mutually exclusive with this one) is to turn on |
Sorry for the delay! I think it would be pretty useful, especially as a temporary mitigation for #74503 (if all replicas lost, can't really find out about that except the hard way). There's probably some design work to decide how exactly to do it. We could actually build this into kvprober, or make it something standalone that uses the same probe as kvprober, or something that uses ProbeRequest, or something that uses reads, or something that uses follower reads, etc. I still like the above suggestion to have a built-in function This project could also be framed in terms of KV observability. |
Np re: delay obvi!! Everything you say about the design space makes sense to me. I think CC-wise I'd rather get #74407 into 22.1 than this one, and I'm not even sure I / some other SRE will have time to do that! |
Just wanted to add,
Ok, good to know, then let's concentrate on that issue for now. |
@tbg I will pick this one up but I was wondering if we go with your proposed solution what exactly would we be returning in the details column? It's not entirely clear to me (bear in mind I know nothing about kvprober aside from looking at the implementation) . |
@Santamaura that is sort of a good question. In my toy output above I put a detailed timing breakdown, but these stats aren't necessarily available on a per-request basis, at least no at the level of detail that we want, so let's not try to include any now. For now, perhaps have a column indicating the end-to-end latency. You could also have a column that contains a verbose trace of the probe, I imagine this will be very interesting when a range is slow. The probe should also have a timeout (or probing will stall when it hits an unavailable range) and use a decent amount of concurrency (probes are not expensive), so perhaps that should be an input (i.e. defaults to something like 1s, but can be overridden to 100ms or 10s). We should also log which replica served the probe. Summary:
We need to implement this vtable generator-style, so that we don't pull all of the info into memory (but stream it out to the client as we go). There should be plenty of examples of this at this point (I'd point to one, but don't know the codebase well enough - I think the inflight trace registry might be an example?) |
PS in an ideal world we'll get something like #72092 and emit it as structured payloads on the trace (similar to this) and then we can provide very accurate timings for each part of the probe. Btw, we should also distinguish between read and write probes & allow both, but if we're only going to implement one "now", let's make it write probes. |
Previously there was difficulty in diagnosing kv layer health when an incident occurs. This patch introduces the new virtual table crdb_internal.probe_range which utilizes the kvprober to probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Previously there was difficulty in diagnosing kv layer health when an incident occurs. This patch introduces the new virtual table crdb_internal.probe_range which utilizes the kvprober to probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Previously there was difficulty diagnosing kv layer health when an incident occurs. This patch introduces a new virtual table crdb_internal.probe_ranges which utilitzes the kvprober probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Previously there was difficulty diagnosing kv layer health when an incident occurs. This patch introduces a new virtual table crdb_internal.probe_ranges which utilitzes the kvprober probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Previously there was difficulty diagnosing kv layer health when an incident occurs. This patch introduces a new virtual table crdb_internal.probe_ranges which utilitzes the kvprober probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Previously there was difficulty diagnosing kv layer health when an incident occurs. This patch introduces a new virtual table crdb_internal.probe_ranges which utilitzes the kvprober probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Previously there was difficulty diagnosing kv layer health when an incident occurs. This patch introduces a new virtual table crdb_internal.probe_ranges which utilitzes the kvprober probe each range to determine if the range can be reached or not. resolves cockroachdb#61695 Release note: None
Is your feature request related to a problem? Please describe.
While the kvprober probes "random" kv ranges at intervals, when in an incident one also wants to conclusively establish health of the KV layer by proactively probing the entire keyspace as quickly as possible. This would either result in specific ranges that fail the probe (and thus an indication that there is an issue at the KV layer or below) or not (thus indicating a failure above the KV layer or a failure not caught by the probe).
Describe the solution you'd like
I think it would make sense to run something like
So the task at hand becomes implementing
crdb_internal.probe
.Describe alternatives you've considered
One could likely come up with many alternative designs.
Additional context
Currently, in practice I believe we (or at least I) ascertain KV health by looking through the logs of messages of the form
have been waiting X for Y
which are emitted on slow replication, slow latching, slow DistSender RPCs, and a few others. With probes becoming suitably powerful, they should be able to catch most of these. These log messages are also hooked up to theslow.*
family of gauges, which however don't let you figure out which ranges have the issue.Jira issue: CRDB-2743
The text was updated successfully, but these errors were encountered: