Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog: impossible to read metrics names for latencies #64373

Closed
tbg opened this issue Apr 29, 2021 · 1 comment
Closed

catalog: impossible to read metrics names for latencies #64373

tbg opened this issue Apr 29, 2021 · 1 comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Apr 29, 2021

Describe the problem

The timeseries catalog at ./pkg/ts/catalog is supposed to let you read off all of the timeseries names that we maintain:

However, in practice this isn't possible because histograms have an additional level of indirection. For example,

sql.txn.latency is a histogram, but the timeseries this exports are of type sql.txn.latency-p99. As a result,
the (*ts.Server).Dump{Raw} endpoints (hit by ./cockroach debug tsdump) fail to export latency data (as they
don't learn the correct names to look for).

One solution to this could be to leave the catalog alone, and fix

To Reproduce

./cockroach debug tsdump against a local CRDB node and notice that sql.txn.latency (or any other histogram) is missing. Personally I noticed this while working on #64329 as data was missing for the latency graphs.

Expected behavior

The complete timeseries names should be discoverable from the catalog.

The actual names of the metrics are in 1 and are included in the return value of 2 and in 3.

Environment:

  • CockroachDB version master 6b79a99

Additional context

The likely impact of this bug is that when we use #64329 for roachtest metrics visualization purposes, we won't be able to look at any latencies unless we add more hacks a la

// TODO(tbg): these prefixes come from pkg/server/status/recorder.go.
// Unfortunately, the chart catalog does not tell us which ones are per-node
// or per-store, so we simply tag both (one of them will be empty).
names := make([]string, 0, 2*len(m))
for name := range m {
names = append(names, "cr.store."+name, "cr.node."+name)
}
sort.Strings(names)

Jira issue: CRDB-7022

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 29, 2021
tbg added a commit to tbg/cockroach that referenced this issue Apr 29, 2021
One of our observability achilles heels has always been that time series
are not included in `debug.zip`. But let's aim lower, they're also not
included in roachtest artifacts. This PR tries to address at least the
latter, and opens up a potential interim solution for the former.

We actually have had the ability to export time series from a cluster
for quite some time, and have recently improved it in cockroachdb#57481. However,
the real problem is visualizing the data. We have something that allows
you to [explore] a dump, but it's unusable until we also do cockroachdb#54178,
which isn't going to happen anytime soon.

For better or worse, in the short term, the most attractive way to get
the data visualized is to dump raw KV pairs into a local cockroachdb
instance and open the DB console.

That's what this commit achieves. Here's a "demo":

```
./cockroach debug tsdump --format raw --insecure \
    --host $(roachprod ip --external tobias-ui: 1) > ts.gob
```

```
COCKROACH_DEBUG_TS_IMPORT_FILE=ts.gob ./cockroach start-single-node --insecure
```

Open the UI and browse:

![image](https://user-images.githubusercontent.com/5076964/116429966-ae470880-a846-11eb-8129-13885bcbab6e.png)

We anticipate trying this out in `roachtest`. It's unclear if it will
ever be included in `debug.zip`, but it seems worthwhile given that
our real solution - the observability server - is a ways out.

There are gotchas.

- the code behind COCKROACH_DEBUG_TS_IMPORT_FILE has to make all kinds of
assumptions about which node each store belongs to and in practice this
means that unless each node has exactly one store, and the IDs match,
the operation will either fail or produce an incorrect mapping,
silently.  What's worse, if this conflicts with the actual running node,
the running node's opinion will win (so things might look at at first,
then flip around, not sure).
- some metrics are missing, see cockroachdb#64373. I think everything except the latency
metrics are there, such as `sql.txn.latency-*`.
- to get the console to even realize these metrics are there, we have to
write fake NodeStatuses. Definitely don't use this on any cluster you
care about.

Given those, the only situation in which I will personally use this is
that of a vanilla roachtest, where NodeIDs and StoreIDs line up nicely.

[dump]: https://github.com/cockroachdb/cockroach/blob/master/scripts/localmetrics/README.md

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 7, 2021
One of our observability achilles heels has always been that time series
are not included in `debug.zip`. But let's aim lower, they're also not
included in roachtest artifacts. This PR tries to address at least the
latter, and opens up a potential interim solution for the former.

We actually have had the ability to export time series from a cluster
for quite some time, and have recently improved it in cockroachdb#57481. However,
the real problem is visualizing the data. We have something that allows
you to [explore] a dump, but it's unusable until we also do cockroachdb#54178,
which isn't going to happen anytime soon.

For better or worse, in the short term, the most attractive way to get
the data visualized is to dump raw KV pairs into a local cockroachdb
instance and open the DB console.

That's what this commit achieves. Here's a "demo":

```
./cockroach debug tsdump --format raw --insecure \
    --host $(roachprod ip --external tobias-ui: 1) > ts.gob
```

```
COCKROACH_DEBUG_TS_IMPORT_FILE=ts.gob ./cockroach start-single-node --insecure
```

Open the UI and browse:

![image](https://user-images.githubusercontent.com/5076964/116429966-ae470880-a846-11eb-8129-13885bcbab6e.png)

We anticipate trying this out in `roachtest`. It's unclear if it will
ever be included in `debug.zip`, but it seems worthwhile given that
our real solution - the observability server - is a ways out.

There are gotchas.

- the code behind COCKROACH_DEBUG_TS_IMPORT_FILE has to make all kinds of
assumptions about which node each store belongs to and in practice this
means that unless each node has exactly one store, and the IDs match,
the operation will either fail or produce an incorrect mapping,
silently.  What's worse, if this conflicts with the actual running node,
the running node's opinion will win (so things might look at at first,
then flip around, not sure). We simply limit to `start-single-node` on
the first start, with a single store, to minimize confusion. Any other
configuration will either ignore the var or error out outright.
- some metrics are missing, see cockroachdb#64373. I think everything except the latency
metrics are there, such as `sql.txn.latency-*`.
- to get the console to even realize these metrics are there, we have to
write fake NodeStatuses. Definitely don't use this on any cluster you
care about.

Given those, the only situation in which I will personally use this is
that of a vanilla roachtest, where NodeIDs and StoreIDs line up nicely.

[dump]: https://github.com/cockroachdb/cockroach/blob/master/scripts/localmetrics/README.md

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 7, 2021
One of our observability achilles heels has always been that time series
are not included in `debug.zip`. But let's aim lower, they're also not
included in roachtest artifacts. This PR tries to address at least the
latter, and opens up a potential interim solution for the former.

We actually have had the ability to export time series from a cluster
for quite some time, and have recently improved it in cockroachdb#57481. However,
the real problem is visualizing the data. We have something that allows
you to [explore] a dump, but it's unusable until we also do cockroachdb#54178,
which isn't going to happen anytime soon.

For better or worse, in the short term, the most attractive way to get
the data visualized is to dump raw KV pairs into a local cockroachdb
instance and open the DB console.

That's what this commit achieves. Here's a "demo":

```
./cockroach debug tsdump --format raw --insecure \
    --host $(roachprod ip --external tobias-ui: 1) > ts.gob
```

```
COCKROACH_DEBUG_TS_IMPORT_FILE=ts.gob ./cockroach start-single-node --insecure
```

Open the UI and browse:

![image](https://user-images.githubusercontent.com/5076964/116429966-ae470880-a846-11eb-8129-13885bcbab6e.png)

We anticipate trying this out in `roachtest`. It's unclear if it will
ever be included in `debug.zip`, but it seems worthwhile given that
our real solution - the observability server - is a ways out.

There are gotchas.

- the code behind COCKROACH_DEBUG_TS_IMPORT_FILE has to make all kinds of
assumptions about which node each store belongs to and in practice this
means that unless each node has exactly one store, and the IDs match,
the operation will either fail or produce an incorrect mapping,
silently.  What's worse, if this conflicts with the actual running node,
the running node's opinion will win (so things might look at at first,
then flip around, not sure). We simply limit to `start-single-node` on
the first start, with a single store, to minimize confusion. Any other
configuration will either ignore the var or error out outright.
- some metrics are missing, see cockroachdb#64373. I think everything except the latency
metrics are there, such as `sql.txn.latency-*`.
- to get the console to even realize these metrics are there, we have to
write fake NodeStatuses. Definitely don't use this on any cluster you
care about.

Given those, the only situation in which I will personally use this is
that of a vanilla roachtest, where NodeIDs and StoreIDs line up nicely.

[dump]: https://github.com/cockroachdb/cockroach/blob/master/scripts/localmetrics/README.md

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 8, 2021
One of our observability achilles heels has always been that time series
are not included in `debug.zip`. But let's aim lower, they're also not
included in roachtest artifacts. This PR tries to address at least the
latter, and opens up a potential interim solution for the former.

We actually have had the ability to export time series from a cluster
for quite some time, and have recently improved it in cockroachdb#57481. However,
the real problem is visualizing the data. We have something that allows
you to [explore] a dump, but it's unusable until we also do cockroachdb#54178,
which isn't going to happen anytime soon.

For better or worse, in the short term, the most attractive way to get
the data visualized is to dump raw KV pairs into a local cockroachdb
instance and open the DB console.

That's what this commit achieves. Here's a "demo":

```
./cockroach debug tsdump --format raw --insecure \
    --host $(roachprod ip --external tobias-ui: 1) > ts.gob
```

```
COCKROACH_DEBUG_TS_IMPORT_FILE=ts.gob ./cockroach start-single-node --insecure
```

Open the UI and browse:

![image](https://user-images.githubusercontent.com/5076964/116429966-ae470880-a846-11eb-8129-13885bcbab6e.png)

We anticipate trying this out in `roachtest`. It's unclear if it will
ever be included in `debug.zip`, but it seems worthwhile given that
our real solution - the observability server - is a ways out.

There are gotchas.

- the code behind COCKROACH_DEBUG_TS_IMPORT_FILE has to make all kinds of
assumptions about which node each store belongs to and in practice this
means that unless each node has exactly one store, and the IDs match,
the operation will either fail or produce an incorrect mapping,
silently.  What's worse, if this conflicts with the actual running node,
the running node's opinion will win (so things might look at at first,
then flip around, not sure). We simply limit to `start-single-node` on
the first start, with a single store, to minimize confusion. Any other
configuration will either ignore the var or error out outright.
- some metrics are missing, see cockroachdb#64373. I think everything except the latency
metrics are there, such as `sql.txn.latency-*`.
- to get the console to even realize these metrics are there, we have to
write fake NodeStatuses. Definitely don't use this on any cluster you
care about.

Given those, the only situation in which I will personally use this is
that of a vanilla roachtest, where NodeIDs and StoreIDs line up nicely.

[dump]: https://github.com/cockroachdb/cockroach/blob/master/scripts/localmetrics/README.md

Release note: None
craig bot pushed a commit that referenced this issue May 8, 2021
64329: cli,ts: allow (hacky) visualization of timeseries dumps r=knz a=tbg

One of our observability achilles heels has always been that time series
are not included in `debug.zip`. But let's aim lower, they're also not
included in roachtest artifacts. This PR tries to address at least the
latter, and opens up a potential interim solution for the former.

We actually have had the ability to export time series from a cluster
for quite some time, and have recently improved it in #57481. However,
the real problem is visualizing the data. We have something that allows
you to [explore] a dump, but it's unusable until we also do #54178,
which isn't going to happen anytime soon.

For better or worse, in the short term, the most attractive way to get
the data visualized is to dump raw KV pairs into a local cockroachdb
instance and open the DB console.

That's what this commit achieves. Here's a "demo":

```
./cockroach debug tsdump --format raw --insecure \
    --host $(roachprod ip --external tobias-ui: 1) > ts.gob
```

```
COCKROACH_DEBUG_TS_IMPORT_FILE=ts.gob ./cockroach start-single-node --insecure
```

Open the UI and browse:

![image](https://user-images.githubusercontent.com/5076964/116429966-ae470880-a846-11eb-8129-13885bcbab6e.png)

We anticipate trying this out in `roachtest`. It's unclear if it will
ever be included in `debug.zip`, but it seems worthwhile given that
our real solution - the observability server - is a ways out.

There are gotchas.

- the code behind COCKROACH_DEBUG_TS_IMPORT_FILE has to make all kinds of
assumptions about which node each store belongs to and in practice this
means that unless each node has exactly one store, and the IDs match,
the operation will either fail or produce an incorrect mapping,
silently.  What's worse, if this conflicts with the actual running node,
the running node's opinion will win (so things might look at at first,
then flip around, not sure).
- some metrics are missing, see #64373. I think everything except the latency
metrics are there, such as `sql.txn.latency-*`.
- to get the console to even realize these metrics are there, we have to
write fake NodeStatuses. Definitely don't use this on any cluster you
care about.

Given those, the only situation in which I will personally use this is
that of a vanilla roachtest, where NodeIDs and StoreIDs line up nicely.

[explore]: https://github.com/cockroachdb/cockroach/blob/master/scripts/localmetrics/README.md

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Jun 23, 2021
One of our observability achilles heels has always been that time series
are not included in `debug.zip`. But let's aim lower, they're also not
included in roachtest artifacts. This PR tries to address at least the
latter, and opens up a potential interim solution for the former.

We actually have had the ability to export time series from a cluster
for quite some time, and have recently improved it in cockroachdb#57481. However,
the real problem is visualizing the data. We have something that allows
you to [explore] a dump, but it's unusable until we also do cockroachdb#54178,
which isn't going to happen anytime soon.

For better or worse, in the short term, the most attractive way to get
the data visualized is to dump raw KV pairs into a local cockroachdb
instance and open the DB console.

That's what this commit achieves. Here's a "demo":

```
./cockroach debug tsdump --format raw --insecure \
    --host $(roachprod ip --external tobias-ui: 1) > ts.gob
```

```
COCKROACH_DEBUG_TS_IMPORT_FILE=ts.gob ./cockroach start-single-node --insecure
```

Open the UI and browse:

![image](https://user-images.githubusercontent.com/5076964/116429966-ae470880-a846-11eb-8129-13885bcbab6e.png)

We anticipate trying this out in `roachtest`. It's unclear if it will
ever be included in `debug.zip`, but it seems worthwhile given that
our real solution - the observability server - is a ways out.

There are gotchas.

- the code behind COCKROACH_DEBUG_TS_IMPORT_FILE has to make all kinds of
assumptions about which node each store belongs to and in practice this
means that unless each node has exactly one store, and the IDs match,
the operation will either fail or produce an incorrect mapping,
silently.  What's worse, if this conflicts with the actual running node,
the running node's opinion will win (so things might look at at first,
then flip around, not sure). We simply limit to `start-single-node` on
the first start, with a single store, to minimize confusion. Any other
configuration will either ignore the var or error out outright.
- some metrics are missing, see cockroachdb#64373. I think everything except the latency
metrics are there, such as `sql.txn.latency-*`.
- to get the console to even realize these metrics are there, we have to
write fake NodeStatuses. Definitely don't use this on any cluster you
care about.

Given those, the only situation in which I will personally use this is
that of a vanilla roachtest, where NodeIDs and StoreIDs line up nicely.

[dump]: https://github.com/cockroachdb/cockroach/blob/master/scripts/localmetrics/README.md

Release note: None
@tbg
Copy link
Member Author

tbg commented Jun 1, 2022

I added a hack in #69469 a while back.

// The histogramMetricsNames variable below was originally seeded using the invocation
// below. It's kept up to date via `server.TestChartCatalogMetrics`.
var _ = `
./cockroach demo --empty -e \
"select name from crdb_internal.node_metrics where name like '%-p50'" | \
sed -E 's/^(.*)-p50$/"\1": {},/'
`
var histogramMetricsNames = map[string]struct{}{
"sql.txn.latency.internal": {},
"sql.conn.latency": {},
"sql.mem.sql.session.max": {},
"sql.stats.flush.duration": {},
"changefeed.checkpoint_hist_nanos": {},
"admission.wait_durations.sql-sql-response": {},
"admission.wait_durations.sql-kv-response": {},
"sql.exec.latency": {},
"sql.stats.mem.max.internal": {},
"admission.wait_durations.sql-leaf-start": {},
"sql.disk.distsql.max": {},
"txn.restarts": {},
"sql.stats.flush.duration.internal": {},
"sql.distsql.exec.latency": {},
"sql.mem.internal.txn.max": {},
"changefeed.emit_hist_nanos": {},
"changefeed.flush_hist_nanos": {},
"sql.service.latency": {},
"round-trip-latency": {},
"admission.wait_durations.kv": {},
"sql.mem.distsql.max": {},
"kv.prober.write.latency": {},
"exec.latency": {},
"admission.wait_durations.sql-root-start": {},
"sql.mem.bulk.max": {},
"sql.distsql.flows.queue_wait": {},
"sql.txn.latency": {},
"sql.mem.root.max": {},
"admission.wait_durations.kv-stores": {},
"sql.stats.mem.max": {},
"sql.distsql.service.latency.internal": {},
"sql.stats.reported.mem.max.internal": {},
"sql.stats.reported.mem.max": {},
"sql.exec.latency.internal": {},
"sql.mem.internal.session.max": {},
"sql.distsql.exec.latency.internal": {},
"kv.prober.read.latency": {},
"sql.distsql.service.latency": {},
"sql.service.latency.internal": {},
"sql.mem.sql.txn.max": {},
"liveness.heartbeatlatency": {},
"txn.durations": {},
"raft.process.handleready.latency": {},
"raft.process.commandcommit.latency": {},
"raft.process.logcommit.latency": {},
"raft.quota_pool.percent_used": {},
"raft.scheduler.latency": {},
"txnwaitqueue.pusher.wait_time": {},
"txnwaitqueue.query.wait_time": {},
"raft.process.applycommitted.latency": {},
"sql.stats.txn_stats_collection.duration": {},
"rebalancing.l0_sublevels_histogram": {},
}

It's ugly, but gets the job done.

@tbg tbg closed this as completed Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants