tsdb/ui: store-specific metrics don't filter by node (or store) correctly #102967

abarganier · 2023-05-09T17:53:45Z

Is your feature request related to a problem? Please describe.

Timeseries is stored under KV, like the rest of our data in CRDB.

A timeseries key contains a source field, which indicates information about which piece of infrastructure that metric is relevant to. For example, for nodeID = 789, you'd have a key such as:

//System/tsd/cr.node.sql.new_conns/.../789

The trailing 789 is the source component of the key. In this case, it tells us this metric was sourced from node ID 789.

However, we also use this source field in the key to indicate which store ID the metric originated from. Metrics like cr.store.raft.commandsapplied use the store ID in the source field, in place of the node ID. So, for this metric originating from store ID 456, on node ID 789, the key would look like:

//System/tsd/cr.store.raft.commandsapplied/.../456

This unfortunately creates problems with our timeseries chart UIs in DB Console. Both in our metric dashboards, as well as the custom timeseries chart tool in the advanced debug page, we have a Node ID filter dropdown that allows users to filter metrics to specific node IDs.

In practice, what this does is it sets a filter on the query request to that node ID as the source to look for in the TSDB key.

So, if you can imagine the following setup:

NodeID 1
|--StoreID 8
|--StoreID 9
NodeID 2
|--StoreID 12
|--StoreID 13

For a store-specific metric like cr.store.raft.commandsapplied, when you set the node ID filter in the UI to NodeID 1, it's setting the source filter on the query request to 1. This means that the server is looking for keys that fit the following format:

//System/tsd/cr.store.raft.commandsapplied/.../1

However, given what we know about these store-specific metrics, and our above example, the only keys available for this metric will look like (one for each store ID):

//System/tsd/cr.store.raft.commandsapplied/.../8
//System/tsd/cr.store.raft.commandsapplied/.../9
//System/tsd/cr.store.raft.commandsapplied/.../12
//System/tsd/cr.store.raft.commandsapplied/.../13

This means that the NodeID = 1 filter set in the request will come back empty. Effectively, the node ID filter in DB Console is broken for both metric dashboards and the custom timeseries chart tool for store-specific metrics.

Describe the solution you'd like
In the above example, if we filter the chart to NodeID = 1 for a store-specific, we should get back an aggregate of all the store metrics that exist on that node. To keep with our example, that means you'd want an aggregate of the following two keys:

//System/tsd/cr.store.raft.commandsapplied/.../8
//System/tsd/cr.store.raft.commandsapplied/.../9

Additionally, it might be a good idea to introduce a separate filter for store ID. Store-specific metrics are prefixed with cr.store.*, so we could conditionally show this filter dropdown in the UI depending on whether the metric is store-specific.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
This discussion came from a bug reported during an escalation here: https://cockroachlabs.slack.com/archives/C01CNRP6TSN/p1683629621213809

Jira issue: CRDB-27762

The text was updated successfully, but these errors were encountered:

ajstorm · 2024-04-19T20:44:59Z

Raising the priority of this issue to P-1 as it's hindering our ability to debug issues on the drt-large cluster.

abarganier · 2024-04-24T15:23:16Z

I believe #121364 was a duplicate of this, which was fixed as of #122151

dhartunian · 2024-05-22T16:06:26Z

This is fixed via #122151

abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels May 9, 2023

blathers-crl bot added the A-observability-inf label May 9, 2023

knz mentioned this issue May 9, 2023

kvstorage: number stores separately from node IDs #102968

Draft

tbg added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 10, 2023

dhartunian added P-2 Issues/test failures with a fix SLA of 3 months P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels Jan 16, 2024

exalate-issue-sync bot added T-observability and removed T-observability-inf labels Mar 21, 2024

ajstorm added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA labels Apr 19, 2024

dhartunian closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsdb/ui: store-specific metrics don't filter by node (or store) correctly #102967

tsdb/ui: store-specific metrics don't filter by node (or store) correctly #102967

abarganier commented May 9, 2023 •

edited by exalate-issue-sync bot

Loading

ajstorm commented Apr 19, 2024

abarganier commented Apr 24, 2024

dhartunian commented May 22, 2024

tsdb/ui: store-specific metrics don't filter by node (or store) correctly #102967

tsdb/ui: store-specific metrics don't filter by node (or store) correctly #102967

Comments

abarganier commented May 9, 2023 • edited by exalate-issue-sync bot Loading

ajstorm commented Apr 19, 2024

abarganier commented Apr 24, 2024

dhartunian commented May 22, 2024

abarganier commented May 9, 2023 •

edited by exalate-issue-sync bot

Loading