Expose network metrics in `system_metric` API #3386

askfongjojo · 2023-06-20T21:19:28Z

The API endpoint currently exposes only metrics under the target named collection_target, i.e. virtual_disk_space_provisioned, cpus_provisioned, ram_provisioned. It'll be good to loosen the restriction so that the same API can be used for querying other system-level metrics, e.g. the recently added networking metrics for data_link, without the need to update the API every time something new is added.

The main goal is to allow easier access to the metrics when engineers work with customers to get more debugging data.

We might also want to bring back the timeseries_schema endpoint (it was taken out some time back) so that user can see what metrics are available. I don't recall what the response looked like when it was there but it's probably something like this:

root@oxz_clickhouse_oxp_3822d3f1-3a2f-43ae-afd8-3ec2fa0cfa54:~# echo 'select * from timeseries_schema format CSVWithNames' | curl 'http://[fd00:1122:3344:104::7]:8123/?database=oximeter'  --data-binary

"timeseries_name","fields.name","fields.type","fields.source","datum_type","created"
"collection_target:cpus_provisioned","['id']","['Uuid']","['Target']","I64","2023-06-19 01:18:48.882423329"
"collection_target:ram_provisioned","['id']","['Uuid']","['Target']","I64","2023-06-19 01:18:48.882476509"
"collection_target:virtual_disk_space_provisioned","['id']","['Uuid']","['Target']","I64","2023-06-19 01:18:48.882523457"
"crucible_upstairs:activated","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172287209"
"crucible_upstairs:extent_no_op","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172734541"
"crucible_upstairs:extent_reopen","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172785838"
"crucible_upstairs:extent_repair","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172683886"
"crucible_upstairs:flush","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172359255"
"crucible_upstairs:flush_close","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172632539"
"crucible_upstairs:read","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172529255"
"crucible_upstairs:read_bytes","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172581173"
"crucible_upstairs:write","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172415912"
"crucible_upstairs:write_bytes","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172473410"
"data_link:bad_sync_headers","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316956038"
"data_link:enabled","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.314880592"
"data_link:errored_blocks","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317020159"
"data_link:fec_align","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.316220494"
"data_link:fec_corr_cnt","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316280857"
"data_link:fec_hi_ser","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.316160000"
"data_link:fec_ser_lane0","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316407245"
"data_link:fec_ser_lane1","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316476075"
"data_link:fec_ser_lane2","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316541869"
"data_link:fec_ser_lane3","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316604256"
"data_link:fec_ser_lane4","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316678125"
"data_link:fec_ser_lane5","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316740433"
"data_link:fec_ser_lane6","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316812979"
"data_link:fec_ser_lane7","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316885396"
"data_link:fec_uncorr_cnt","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316342293"
"data_link:link_up","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.315518873"
"data_link:pci_hi_ber","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317204025"
"data_link:pcs_block_lock_loss","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317142079"
"data_link:pcs_invalid_errors","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317391939"
"data_link:pcs_sync_loss","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317079721"
"data_link:pcs_unknown_errors","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317326526"
"data_link:pcs_valid_errors","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317264690"
"data_link:rx_buf_full","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315848183"
"data_link:rx_bytes","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315705314"
"data_link:rx_crc_errs","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315772831"
"data_link:rx_errs","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315913296"
"data_link:rx_pkts","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315621086"
"data_link:tx_bytes","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316037239"
"data_link:tx_errs","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316099486"
"data_link:tx_pkts","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315975162"
"http_service:request_latency_histogram","['name','id','route','method','status_code']","['String','Uuid','String','String','I64']","['Target','Target','Metric','Metric','Metric']","HistogramF64","2023-06-19 01:18:48.869919693"
"instance_uuid:reset","['uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172134582"
"sidecar:sample_time","['rack_id','sled_id','sidecar_id','board_rev']","['Uuid','Uuid','Uuid','String']","['Target','Target','Target','Target']","I64","2023-06-19 01:17:53.363000801"

cc @david-crespo

The text was updated successfully, but these errors were encountered:

bnaecker · 2023-06-20T21:32:14Z

Part of the issue here is that I don't believe we've settled on a good story for exposing metrics in general. The disk-related endpoints are one potential path, where we do something like create a specific endpoint for each of the metrics we'd like to expose. On the other end of the spectrum, one could imagine a pretty general way to all the metrics we have; a strawman would be something like providing full SQL queries for selecting data from the timeseries. Those two have lots of tradeoffs, some of which were discussed in RFD 304. There was no general resolution, and without someone to focus on this it was hard to drive a prototype forward.

david-crespo · 2023-06-20T21:42:51Z

I am inclined to keep punting on a per-metric basis until we can get someone working on this problem full-time. Seeing how we need to query these metrics for our own purposes is useful input for designing the better system.

askfongjojo · 2023-06-21T00:06:38Z

@bnaecker - Thanks for pointing me back to RFD 304. I've read it previously but definitely failed to recall some of the concerns raised there when filing this ticket.

This may be one of the cases that can benefit from having the early version of metrics API classified as "experimental". I think there is value in continuing to expand the /system/metrics endpoint to provide ourselves with more support tools in the field while limiting the API access to super users. Querying based on the object uuid (without any higher-level abstraction or joins) is hopefully the least expensive way to expose the data. The burden of linking the uuid back to human-readable identifiers or parent objects will fall on the engineer consuming the metrics. Once we gain more support experience, we can come up with the appropriate abstractions for the future customer-facing API.

I just think that it's worse to keep metrics in limbo till we know what the customer needs. With rack-level objects not owned by end-users (e.g. datalinks, sleds, physical disks), we ourselves are going to be the first customer.

@david-crespo - Given the pending design decisions, a per-metric basis implementation sounds good. If the implementation is going to be more costly than expected, please bring it up for discussion.

bnaecker · 2023-06-21T00:20:38Z

I think this might be orthogonal to what you're proposing, but if the goal is for Oxide developers to consume metrics, then you can currently read any data available in ClickHouse with the oxdb CLI tool in the oximeter_db crate. That is definitely rough, but I found it helpful when building the initial system. Also, that's entirely outside the console, so if that's the end goal here, oxdb is definitely a tool for a different audience.

Is that what you're looking for? Or is the goal definitely to expose things in the API and console?

askfongjojo · 2023-06-21T00:37:33Z

I don't intend to have more metrics exposed in the console. It sounds like the oxdb CLI already serves the purpose. So maybe we can document it as a support tool (e.g. under https://github.com/oxidecomputer/meta/tree/master/engineering/rack-support) and punt the API side of things.

askfongjojo · 2024-04-03T19:22:23Z

Implemented in #5273

askfongjojo added this to the MVP milestone Jun 20, 2023

smklein added the Metrics label Oct 12, 2023

askfongjojo closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose network metrics in `system_metric` API #3386

Expose network metrics in `system_metric` API #3386

askfongjojo commented Jun 20, 2023

bnaecker commented Jun 20, 2023

david-crespo commented Jun 20, 2023 •

edited

Loading

askfongjojo commented Jun 21, 2023 •

edited

Loading

bnaecker commented Jun 21, 2023

askfongjojo commented Jun 21, 2023

askfongjojo commented Apr 3, 2024

Expose network metrics in system_metric API #3386

Expose network metrics in system_metric API #3386

Comments

askfongjojo commented Jun 20, 2023

bnaecker commented Jun 20, 2023

david-crespo commented Jun 20, 2023 • edited Loading

askfongjojo commented Jun 21, 2023 • edited Loading

bnaecker commented Jun 21, 2023

askfongjojo commented Jun 21, 2023

askfongjojo commented Apr 3, 2024

Expose network metrics in `system_metric` API #3386

Expose network metrics in `system_metric` API #3386

david-crespo commented Jun 20, 2023 •

edited

Loading

askfongjojo commented Jun 21, 2023 •

edited

Loading