Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb/ui: store-specific metrics don't filter by node (or store) correctly #102967

Closed
abarganier opened this issue May 9, 2023 · 3 comments
Closed
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month T-observability

Comments

@abarganier
Copy link
Contributor

abarganier commented May 9, 2023

Is your feature request related to a problem? Please describe.

Timeseries is stored under KV, like the rest of our data in CRDB.

A timeseries key contains a source field, which indicates information about which piece of infrastructure that metric is relevant to. For example, for nodeID = 789, you'd have a key such as:

//System/tsd/cr.node.sql.new_conns/.../789

The trailing 789 is the source component of the key. In this case, it tells us this metric was sourced from node ID 789.

However, we also use this source field in the key to indicate which store ID the metric originated from. Metrics like cr.store.raft.commandsapplied use the store ID in the source field, in place of the node ID. So, for this metric originating from store ID 456, on node ID 789, the key would look like:

//System/tsd/cr.store.raft.commandsapplied/.../456

This unfortunately creates problems with our timeseries chart UIs in DB Console. Both in our metric dashboards, as well as the custom timeseries chart tool in the advanced debug page, we have a Node ID filter dropdown that allows users to filter metrics to specific node IDs.

In practice, what this does is it sets a filter on the query request to that node ID as the source to look for in the TSDB key.

So, if you can imagine the following setup:

NodeID 1
|--StoreID 8
|--StoreID 9
NodeID 2
|--StoreID 12
|--StoreID 13

For a store-specific metric like cr.store.raft.commandsapplied, when you set the node ID filter in the UI to NodeID 1, it's setting the source filter on the query request to 1. This means that the server is looking for keys that fit the following format:

//System/tsd/cr.store.raft.commandsapplied/.../1

However, given what we know about these store-specific metrics, and our above example, the only keys available for this metric will look like (one for each store ID):

//System/tsd/cr.store.raft.commandsapplied/.../8
//System/tsd/cr.store.raft.commandsapplied/.../9
//System/tsd/cr.store.raft.commandsapplied/.../12
//System/tsd/cr.store.raft.commandsapplied/.../13

This means that the NodeID = 1 filter set in the request will come back empty. Effectively, the node ID filter in DB Console is broken for both metric dashboards and the custom timeseries chart tool for store-specific metrics.

Describe the solution you'd like
In the above example, if we filter the chart to NodeID = 1 for a store-specific, we should get back an aggregate of all the store metrics that exist on that node. To keep with our example, that means you'd want an aggregate of the following two keys:

//System/tsd/cr.store.raft.commandsapplied/.../8
//System/tsd/cr.store.raft.commandsapplied/.../9

Additionally, it might be a good idea to introduce a separate filter for store ID. Store-specific metrics are prefixed with cr.store.*, so we could conditionally show this filter dropdown in the UI depending on whether the metric is store-specific.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
This discussion came from a bug reported during an escalation here: https://cockroachlabs.slack.com/archives/C01CNRP6TSN/p1683629621213809

Jira issue: CRDB-27762

@abarganier abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels May 9, 2023
@tbg tbg added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 10, 2023
@dhartunian dhartunian added P-2 Issues/test failures with a fix SLA of 3 months P-3 Issues/test failures with no fix SLA and removed P-2 Issues/test failures with a fix SLA of 3 months labels Jan 16, 2024
@ajstorm ajstorm added O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month and removed P-3 Issues/test failures with no fix SLA labels Apr 19, 2024
@ajstorm
Copy link
Collaborator

ajstorm commented Apr 19, 2024

Raising the priority of this issue to P-1 as it's hindering our ability to debug issues on the drt-large cluster.

@abarganier
Copy link
Contributor Author

I believe #121364 was a duplicate of this, which was fixed as of #122151

@dhartunian
Copy link
Collaborator

This is fixed via #122151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month T-observability
Projects
None yet
Development

No branches or pull requests

4 participants