Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvprober: allow a manual probe of the entire keyspace #61695

Closed
tbg opened this issue Mar 9, 2021 · 7 comments · Fixed by #79546
Closed

kvprober: allow a manual probe of the entire keyspace #61695

tbg opened this issue Mar 9, 2021 · 7 comments · Fixed by #79546
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@tbg
Copy link
Member

tbg commented Mar 9, 2021

Is your feature request related to a problem? Please describe.
While the kvprober probes "random" kv ranges at intervals, when in an incident one also wants to conclusively establish health of the KV layer by proactively probing the entire keyspace as quickly as possible. This would either result in specific ranges that fail the probe (and thus an indication that there is an issue at the KV layer or below) or not (thus indicating a failure above the KV layer or a failure not caught by the probe).

Describe the solution you'd like

I think it would make sense to run something like

select crdb_internal.probe_range(start_key) from crdb_internal.ranges_no_leases;
  range_id | pass | details
-----------+------+----------
        1  | true | {"read_ns": 150031, "write_ns": ...}
[...]

So the task at hand becomes implementing crdb_internal.probe.

Describe alternatives you've considered

One could likely come up with many alternative designs.

Additional context

Currently, in practice I believe we (or at least I) ascertain KV health by looking through the logs of messages of the form have been waiting X for Y which are emitted on slow replication, slow latching, slow DistSender RPCs, and a few others. With probes becoming suitably powerful, they should be able to catch most of these. These log messages are also hooked up to the slow.* family of gauges, which however don't let you figure out which ranges have the issue.

Jira issue: CRDB-2743

@tbg tbg added A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Mar 9, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@joshimhoff
Copy link
Collaborator

@tbg, how useful do you think this would be to you, KV folks, support, etc.?

Also do you have any concerns re: safety? In a really really high number of range cluster, perhaps the interface as you have written it could lead to overloading it temporarily?

If useful, another idea (not mutually exclusive with this one) is to turn on kvprober by default in 22.1. I like that idea for sure.

@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Jan 4, 2022
@tbg
Copy link
Member Author

tbg commented Jan 17, 2022

Sorry for the delay! I think it would be pretty useful, especially as a temporary mitigation for #74503 (if all replicas lost, can't really find out about that except the hard way). There's probably some design work to decide how exactly to do it. We could actually build this into kvprober, or make it something standalone that uses the same probe as kvprober, or something that uses ProbeRequest, or something that uses reads, or something that uses follower reads, etc.

I still like the above suggestion to have a built-in function crdb_internal.probe_range, perhaps that could take an argument indicating the type of probe.

This project could also be framed in terms of KV observability.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jan 18, 2022

Np re: delay obvi!! Everything you say about the design space makes sense to me.

I think CC-wise I'd rather get #74407 into 22.1 than this one, and I'm not even sure I / some other SRE will have time to do that!

@tbg
Copy link
Member Author

tbg commented Jan 20, 2022

Just wanted to add, crdb_internal.ranges is sort of a straw-man "probe all ranges". This is because it queries the leaseholder for each range, meaning it at least needs to be able to route to the leaseholder for each range and reach a replica that responds with the lease. It doesn't have good behavior on failure (just hangs forever) but it's interesting that this exists; definitely it will detect when all replicas of a range are missing.

I think CC-wise I'd rather get #74407 into 22.1 than this one, and I'm not even sure I / some other SRE will have time to do that!

Ok, good to know, then let's concentrate on that issue for now.

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Feb 7, 2022
@Santamaura
Copy link
Contributor

Santamaura commented Mar 3, 2022

@tbg I will pick this one up but I was wondering if we go with your proposed solution what exactly would we be returning in the details column? It's not entirely clear to me (bear in mind I know nothing about kvprober aside from looking at the implementation) .

@tbg
Copy link
Member Author

tbg commented Mar 4, 2022

@Santamaura that is sort of a good question. In my toy output above I put a detailed timing breakdown, but these stats aren't necessarily available on a per-request basis, at least no at the level of detail that we want, so let's not try to include any now. For now, perhaps have a column indicating the end-to-end latency. You could also have a column that contains a verbose trace of the probe, I imagine this will be very interesting when a range is slow. The probe should also have a timeout (or probing will stall when it hits an unavailable range) and use a decent amount of concurrency (probes are not expensive), so perhaps that should be an input (i.e. defaults to something like 1s, but can be overridden to 100ms or 10s). We should also log which replica served the probe.

Summary:

  • rangeID
  • replica that served the probe (maybe this is hard to get, but I think there is a ReturnRangeInfo flag on the batch header)
  • error (NULL if none, meaning probe succeeded; if timed out this should be a descriptive error communicating that, not just "context canceled")
  • end-to-end latency
  • verbose trace

We need to implement this vtable generator-style, so that we don't pull all of the info into memory (but stream it out to the client as we go). There should be plenty of examples of this at this point (I'd point to one, but don't know the codebase well enough - I think the inflight trace registry might be an example?)

@tbg
Copy link
Member Author

tbg commented Mar 4, 2022

PS in an ideal world we'll get something like #72092 and emit it as structured payloads on the trace (similar to this) and then we can provide very accurate timings for each part of the probe.

Btw, we should also distinguish between read and write probes & allow both, but if we're only going to implement one "now", let's make it write probes.

Santamaura added a commit to Santamaura/cockroach that referenced this issue Apr 7, 2022
Previously there was difficulty in diagnosing kv layer health when
an incident occurs. This patch introduces the new virtual table
crdb_internal.probe_range which utilizes the kvprober to probe
each range to determine if the range can be reached or not.

resolves cockroachdb#61695

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Apr 13, 2022
Previously there was difficulty in diagnosing kv layer health when
an incident occurs. This patch introduces the new virtual table
crdb_internal.probe_range which utilizes the kvprober to probe
each range to determine if the range can be reached or not.

resolves cockroachdb#61695

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Apr 20, 2022
Previously there was difficulty diagnosing kv layer health
when an incident occurs. This patch introduces a new virtual
table crdb_internal.probe_ranges which utilitzes the kvprober
probe each range to determine if the range can be reached or
not.

resolves cockroachdb#61695

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Apr 21, 2022
Previously there was difficulty diagnosing kv layer health
when an incident occurs. This patch introduces a new virtual
table crdb_internal.probe_ranges which utilitzes the kvprober
probe each range to determine if the range can be reached or
not.

resolves cockroachdb#61695

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Apr 22, 2022
Previously there was difficulty diagnosing kv layer health
when an incident occurs. This patch introduces a new virtual
table crdb_internal.probe_ranges which utilitzes the kvprober
probe each range to determine if the range can be reached or
not.

resolves cockroachdb#61695

Release note: None
@craig craig bot closed this as completed in 6bbb282 Apr 25, 2022
Santamaura added a commit to Santamaura/cockroach that referenced this issue Jul 5, 2022
Previously there was difficulty diagnosing kv layer health
when an incident occurs. This patch introduces a new virtual
table crdb_internal.probe_ranges which utilitzes the kvprober
probe each range to determine if the range can be reached or
not.

resolves cockroachdb#61695

Release note: None
Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 2, 2022
Previously there was difficulty diagnosing kv layer health
when an incident occurs. This patch introduces a new virtual
table crdb_internal.probe_ranges which utilitzes the kvprober
probe each range to determine if the range can be reached or
not.

resolves cockroachdb#61695

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants