Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db-console: per-store metrics not displayed after importing multi-store tsdump #121364

Closed
kvoli opened this issue Mar 29, 2024 · 3 comments · Fixed by #122151
Closed

db-console: per-store metrics not displayed after importing multi-store tsdump #121364

kvoli opened this issue Mar 29, 2024 · 3 comments · Fixed by #122151
Assignees
Labels
branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-observability

Comments

@kvoli
Copy link
Collaborator

kvoli commented Mar 29, 2024

Describe the problem

Custom metrics page doesn't display per-node metrics (multi-store) after importing TSDump.

To Reproduce

See TSDump from this issue (internal).

Stage the TSDump using the linked debug zip stores yaml and tsdump.gob.

Expected behavior

Per-node metrics are displayed in the custom metrics page.

Environment:

  • CockroachDB version v23.1.13
  • Client app TSDump single node import

Jira issue: CRDB-37214

@kvoli kvoli added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 29, 2024
@kvoli kvoli added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Apr 1, 2024
@sumeerbhola
Copy link
Collaborator

I would consider this a release blocker -- it seriously hinders our ability to use existing metrics. Thoughts?

@dhartunian dhartunian added GA-blocker branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 labels Apr 3, 2024
@abarganier
Copy link
Contributor

abarganier commented Apr 9, 2024

Definitely some unfortunate behavior related to multi-store nodes.

If I run a multi-store cluster and then run a workload:

$ ./cockroach start-single-node --insecure --store=./cockroach-data/store1 --store=./cockroach-data/store2 --store=./cockroach-data/store3
$ ./cockroach workload init tpcc
$ ./cockroach workload run tpcc

If you view a store-level metric such as cr.store.capacity.used, you get something resembling this for the cluster (which, in our case, is also the node, because we only have 1 node):
Screenshot 2024-04-09 at 3 34 12 PM

We see a value exceeding 750MiB. This is because we have 3 stores, and the sum of the usage of each node is what's plotted on the chart. If we take a tsdump we can see the raw metric values for each store. Note that the sum of all 3 matches what's plotted:

cr.store.capacity.used 1
1712690930000000000 3.373435e+08 // 337MiB

cr.store.capacity.used 2
1712690930000000000 2.85115949e+08 // 285MiB

cr.store.capacity.used 3
1712690930000000000 1.83637545e+08 // 183MiB

However, if we tick the "Per Node" box, we see the node-level view, with values around ~340MiB, which we can tell is the value from store1:
Screenshot 2024-04-09 at 3 34 19 PM

This is confusing. First of all, the data here is at the store-level, not the node-level. In the tsdump, the ID attribute associated with the key is either a storeID or a nodeID.

Even so, n1's combined stores are using ~750MiB, but we can only see that in the cluster-level view. As soon as we switch to node-level, suddenly n1 is only reported as using ~340MiB. In reality, what's being reported here is store1.

TL;DR:

  1. Store metrics are being recorded into TSDB at the proper store level, so the underlying data is correct. It's how it's being queried/displayed that's problematic.
  2. If a node has multiple stores, the cluster-level view will SUM all the stores across each node to derive a value. However, if you switch to the node-level view, each node only reports values from its 1st store instead of SUM. This is an inconsistent and misleading experience.

@abarganier
Copy link
Contributor

Potentially related: link (internal)

abarganier added a commit to abarganier/cockroach that referenced this issue Apr 15, 2024
Fixes: cockroachdb#121364

This patch fixes a bug in the DB Console custom chart tool, where
selecting the "Per Node" checkbox on a metric would not properly display
store-level metrics. The previous expected behavior was that the check
box would cause the metric to aggregate across stores at the node level
(e.g. if the node had 3 stores, it'd SUM the store-level timeseries
together and return a single timeseries for the node). Instead, the
feature was only showing the 1st store associated with the node.

This was due to a bug in the code used to determine if a metric was
store-level. A function was used that improperly assumed that the
`cr.node.*` or `cr.store.*` prefix had been stripped from the metric
name, which was not always the case. This led to us improperly
interpret store-level metrics as node-level.

The fix is to fix the logic used to determine if a metric is
store-level.

Additionally, this patch updates the code to no longer aggregate
store-level metrics across each node. Instead, we will now show a single
timeseries per-store to provide finer-grained observability into
store-level metrics within the custom chart tool.

Release note (bug fix): A bug has been fixed in the DB Console's custom
chart tool, where store-level metrics were not being displayed properly.
Previously, if a store-level metric was selected to be displayed at the
store-level on a multi-store node, only data for the 1st store ID
associated with that node would be displayed.

This patch ensures that data is displayed for all stores present on a
node. Additionally, it updates the behavior to show a single timeseries
for each store, as opposed to aggregating (e.g. SUM) all stores across
the node. This allows finer-grained observability into store-level
metrics when using the custom chart tool in DB Console.
@lunevalex lunevalex added the P-2 Issues/test failures with a fix SLA of 3 months label Apr 16, 2024
craig bot pushed a commit that referenced this issue Apr 19, 2024
121380: ui: add static images to asset build step r=laurenbarker a=dhartunian

During the `genassets` build + embed step, we were taking just the output of the `db-console-ccl` or `db-console-oss` step which is just a build.js file. This commit adds references to the image assets we want bundled as well. This includes favicon.ico and everything in `./ assets` relative to the db-console build directory.

We disable content hashing in webpack in order to keep the filenames static, which bazel requires. The impact should be minimal as we rarely change these images so if they're cached forever, it's fine.

This change restores the favicon to the CRDB build and the nice image that shows up in the background of the email signup bar.

The size of the final zipped bundle only differs by around 1MB and is already 10MB in size.

Fixes: #117876
Epic: None

Release note (ui change): the favicon now renders properly for DB Console along with other image files.

122151: ui: make custom chart tool work at store level r=koorosh a=abarganier

Fixes: #121364

This patch fixes a bug in the DB Console custom chart tool, where
selecting the "Per Node" checkbox on a metric would not properly display
store-level metrics. The previous expected behavior was that the check
box would cause the metric to aggregate across stores at the node level
(e.g. if the node had 3 stores, it'd SUM the store-level timeseries
together and return a single timeseries for the node). Instead, the
feature was only showing the 1st store associated with the node.

This was due to a bug in the code used to determine if a metric was
store-level. A function was used that improperly assumed that the
`cr.node.*` or `cr.store.*` prefix had been stripped from the metric
name, which was not always the case. This led to us improperly
interpret store-level metrics as node-level.

The fix is to fix the logic used to determine if a metric is
store-level.

Additionally, this patch updates the code to no longer aggregate
store-level metrics across each node. Instead, we will now show a single
timeseries per-store to provide finer-grained observability into
store-level metrics within the custom chart tool.

Release note (bug fix): A bug has been fixed in the DB Console's custom
chart tool, where store-level metrics were not being displayed properly.
Previously, if a store-level metric was selected to be displayed at the
store-level on a multi-store node, only data for the 1st store ID
associated with that node would be displayed.

This patch ensures that data is displayed for all stores present on a
node. Additionally, it updates the behavior to show a single timeseries
for each store, as opposed to aggregating (e.g. SUM) all stores across
the node. This allows finer-grained observability into store-level
metrics when using the custom chart tool in DB Console.

122539: spanconfigreconcilerccl: use txn descriptor ID generation for test r=rimadeodhar a=rimadeodhar

This PR updates the spanconfigreconciler data driven test to use transactional descriptor ID generation
(#69226) to generate deterministic descriptor IDs. This will help avoid test flakes around changing descriptor IDs due to transaction retries etc.

Epic: none
Fixes: #122343
Release note: None

122557: catalog: add descriptor repair to remove missing roles r=fqazi a=fqazi

Previously, we had a bug that could lead to descriptors having privileages to roles that no longer exist. This could lead to certain commands like SHOW GRANTS breaking. To address this, this patch will add descirptor repair logic to automatically clean up oprhaned privileges.

Fixes: #122552

Release note (bug fix): Add automated clean up / validation for dropped roles inside descriptors.

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: rimadeodhar <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
@craig craig bot closed this as completed in c5e95e9 Apr 19, 2024
blathers-crl bot pushed a commit that referenced this issue Apr 19, 2024
Fixes: #121364

This patch fixes a bug in the DB Console custom chart tool, where
selecting the "Per Node" checkbox on a metric would not properly display
store-level metrics. The previous expected behavior was that the check
box would cause the metric to aggregate across stores at the node level
(e.g. if the node had 3 stores, it'd SUM the store-level timeseries
together and return a single timeseries for the node). Instead, the
feature was only showing the 1st store associated with the node.

This was due to a bug in the code used to determine if a metric was
store-level. A function was used that improperly assumed that the
`cr.node.*` or `cr.store.*` prefix had been stripped from the metric
name, which was not always the case. This led to us improperly
interpret store-level metrics as node-level.

The fix is to fix the logic used to determine if a metric is
store-level.

Additionally, this patch updates the code to no longer aggregate
store-level metrics across each node. Instead, we will now show a single
timeseries per-store to provide finer-grained observability into
store-level metrics within the custom chart tool.

Release note (bug fix): A bug has been fixed in the DB Console's custom
chart tool, where store-level metrics were not being displayed properly.
Previously, if a store-level metric was selected to be displayed at the
store-level on a multi-store node, only data for the 1st store ID
associated with that node would be displayed.

This patch ensures that data is displayed for all stores present on a
node. Additionally, it updates the behavior to show a single timeseries
for each store, as opposed to aggregating (e.g. SUM) all stores across
the node. This allows finer-grained observability into store-level
metrics when using the custom chart tool in DB Console.
abarganier added a commit to abarganier/cockroach that referenced this issue Apr 22, 2024
Fixes: cockroachdb#121364

This patch fixes a bug in the DB Console custom chart tool, where
selecting the "Per Node" checkbox on a metric would not properly display
store-level metrics. The previous expected behavior was that the check
box would cause the metric to aggregate across stores at the node level
(e.g. if the node had 3 stores, it'd SUM the store-level timeseries
together and return a single timeseries for the node). Instead, the
feature was only showing the 1st store associated with the node.

This was due to a bug in the code used to determine if a metric was
store-level. A function was used that improperly assumed that the
`cr.node.*` or `cr.store.*` prefix had been stripped from the metric
name, which was not always the case. This led to us improperly
interpret store-level metrics as node-level.

The fix is to fix the logic used to determine if a metric is
store-level.

Additionally, this patch updates the code to no longer aggregate
store-level metrics across each node. Instead, we will now show a single
timeseries per-store to provide finer-grained observability into
store-level metrics within the custom chart tool.

Release note (bug fix): A bug has been fixed in the DB Console's custom
chart tool, where store-level metrics were not being displayed properly.
Previously, if a store-level metric was selected to be displayed at the
store-level on a multi-store node, only data for the 1st store ID
associated with that node would be displayed.

This patch ensures that data is displayed for all stores present on a
node. Additionally, it updates the behavior to show a single timeseries
for each store, as opposed to aggregating (e.g. SUM) all stores across
the node. This allows finer-grained observability into store-level
metrics when using the custom chart tool in DB Console.
abarganier added a commit that referenced this issue Apr 26, 2024
Fixes: #121364

This patch fixes a bug in the DB Console custom chart tool, where
selecting the "Per Node" checkbox on a metric would not properly display
store-level metrics. The previous expected behavior was that the check
box would cause the metric to aggregate across stores at the node level
(e.g. if the node had 3 stores, it'd SUM the store-level timeseries
together and return a single timeseries for the node). Instead, the
feature was only showing the 1st store associated with the node.

This was due to a bug in the code used to determine if a metric was
store-level. A function was used that improperly assumed that the
`cr.node.*` or `cr.store.*` prefix had been stripped from the metric
name, which was not always the case. This led to us improperly
interpret store-level metrics as node-level.

The fix is to fix the logic used to determine if a metric is
store-level.

Additionally, this patch updates the code to no longer aggregate
store-level metrics across each node. Instead, we will now show a single
timeseries per-store to provide finer-grained observability into
store-level metrics within the custom chart tool.

Release note (bug fix): A bug has been fixed in the DB Console's custom
chart tool, where store-level metrics were not being displayed properly.
Previously, if a store-level metric was selected to be displayed at the
store-level on a multi-store node, only data for the 1st store ID
associated with that node would be displayed.

This patch ensures that data is displayed for all stores present on a
node. Additionally, it updates the behavior to show a single timeseries
for each store, as opposed to aggregating (e.g. SUM) all stores across
the node. This allows finer-grained observability into store-level
metrics when using the custom chart tool in DB Console.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants