Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose network metrics in system_metric API #3386

Closed
askfongjojo opened this issue Jun 20, 2023 · 6 comments
Closed

Expose network metrics in system_metric API #3386

askfongjojo opened this issue Jun 20, 2023 · 6 comments
Labels
Milestone

Comments

@askfongjojo
Copy link

The API endpoint currently exposes only metrics under the target named collection_target, i.e. virtual_disk_space_provisioned, cpus_provisioned, ram_provisioned. It'll be good to loosen the restriction so that the same API can be used for querying other system-level metrics, e.g. the recently added networking metrics for data_link, without the need to update the API every time something new is added.

The main goal is to allow easier access to the metrics when engineers work with customers to get more debugging data.

We might also want to bring back the timeseries_schema endpoint (it was taken out some time back) so that user can see what metrics are available. I don't recall what the response looked like when it was there but it's probably something like this:

root@oxz_clickhouse_oxp_3822d3f1-3a2f-43ae-afd8-3ec2fa0cfa54:~# echo 'select * from timeseries_schema format CSVWithNames' | curl 'http://[fd00:1122:3344:104::7]:8123/?database=oximeter'  --data-binary
"timeseries_name","fields.name","fields.type","fields.source","datum_type","created"
"collection_target:cpus_provisioned","['id']","['Uuid']","['Target']","I64","2023-06-19 01:18:48.882423329"
"collection_target:ram_provisioned","['id']","['Uuid']","['Target']","I64","2023-06-19 01:18:48.882476509"
"collection_target:virtual_disk_space_provisioned","['id']","['Uuid']","['Target']","I64","2023-06-19 01:18:48.882523457"
"crucible_upstairs:activated","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172287209"
"crucible_upstairs:extent_no_op","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172734541"
"crucible_upstairs:extent_reopen","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172785838"
"crucible_upstairs:extent_repair","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172683886"
"crucible_upstairs:flush","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172359255"
"crucible_upstairs:flush_close","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172632539"
"crucible_upstairs:read","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172529255"
"crucible_upstairs:read_bytes","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172581173"
"crucible_upstairs:write","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172415912"
"crucible_upstairs:write_bytes","['upstairs_uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172473410"
"data_link:bad_sync_headers","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316956038"
"data_link:enabled","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.314880592"
"data_link:errored_blocks","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317020159"
"data_link:fec_align","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.316220494"
"data_link:fec_corr_cnt","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316280857"
"data_link:fec_hi_ser","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.316160000"
"data_link:fec_ser_lane0","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316407245"
"data_link:fec_ser_lane1","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316476075"
"data_link:fec_ser_lane2","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316541869"
"data_link:fec_ser_lane3","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316604256"
"data_link:fec_ser_lane4","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316678125"
"data_link:fec_ser_lane5","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316740433"
"data_link:fec_ser_lane6","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316812979"
"data_link:fec_ser_lane7","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316885396"
"data_link:fec_uncorr_cnt","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316342293"
"data_link:link_up","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","Bool","2023-06-19 01:17:53.315518873"
"data_link:pci_hi_ber","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317204025"
"data_link:pcs_block_lock_loss","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317142079"
"data_link:pcs_invalid_errors","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317391939"
"data_link:pcs_sync_loss","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317079721"
"data_link:pcs_unknown_errors","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317326526"
"data_link:pcs_valid_errors","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.317264690"
"data_link:rx_buf_full","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315848183"
"data_link:rx_bytes","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315705314"
"data_link:rx_crc_errs","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315772831"
"data_link:rx_errs","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315913296"
"data_link:rx_pkts","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315621086"
"data_link:tx_bytes","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316037239"
"data_link:tx_errs","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.316099486"
"data_link:tx_pkts","['rack_id','sled_id','sidecar_id','port_id','link_id']","['Uuid','Uuid','Uuid','String','I64']","['Target','Target','Target','Target','Target']","CumulativeI64","2023-06-19 01:17:53.315975162"
"http_service:request_latency_histogram","['name','id','route','method','status_code']","['String','Uuid','String','String','I64']","['Target','Target','Metric','Metric','Metric']","HistogramF64","2023-06-19 01:18:48.869919693"
"instance_uuid:reset","['uuid']","['Uuid']","['Target']","CumulativeI64","2023-06-19 05:21:34.172134582"
"sidecar:sample_time","['rack_id','sled_id','sidecar_id','board_rev']","['Uuid','Uuid','Uuid','String']","['Target','Target','Target','Target']","I64","2023-06-19 01:17:53.363000801"

cc @david-crespo

@askfongjojo askfongjojo added this to the MVP milestone Jun 20, 2023
@bnaecker
Copy link
Collaborator

Part of the issue here is that I don't believe we've settled on a good story for exposing metrics in general. The disk-related endpoints are one potential path, where we do something like create a specific endpoint for each of the metrics we'd like to expose. On the other end of the spectrum, one could imagine a pretty general way to all the metrics we have; a strawman would be something like providing full SQL queries for selecting data from the timeseries. Those two have lots of tradeoffs, some of which were discussed in RFD 304. There was no general resolution, and without someone to focus on this it was hard to drive a prototype forward.

@david-crespo
Copy link
Contributor

david-crespo commented Jun 20, 2023

I am inclined to keep punting on a per-metric basis until we can get someone working on this problem full-time. Seeing how we need to query these metrics for our own purposes is useful input for designing the better system.

@askfongjojo
Copy link
Author

askfongjojo commented Jun 21, 2023

@bnaecker - Thanks for pointing me back to RFD 304. I've read it previously but definitely failed to recall some of the concerns raised there when filing this ticket.

This may be one of the cases that can benefit from having the early version of metrics API classified as "experimental". I think there is value in continuing to expand the /system/metrics endpoint to provide ourselves with more support tools in the field while limiting the API access to super users. Querying based on the object uuid (without any higher-level abstraction or joins) is hopefully the least expensive way to expose the data. The burden of linking the uuid back to human-readable identifiers or parent objects will fall on the engineer consuming the metrics. Once we gain more support experience, we can come up with the appropriate abstractions for the future customer-facing API.

I just think that it's worse to keep metrics in limbo till we know what the customer needs. With rack-level objects not owned by end-users (e.g. datalinks, sleds, physical disks), we ourselves are going to be the first customer.

@david-crespo - Given the pending design decisions, a per-metric basis implementation sounds good. If the implementation is going to be more costly than expected, please bring it up for discussion.

@bnaecker
Copy link
Collaborator

I think this might be orthogonal to what you're proposing, but if the goal is for Oxide developers to consume metrics, then you can currently read any data available in ClickHouse with the oxdb CLI tool in the oximeter_db crate. That is definitely rough, but I found it helpful when building the initial system. Also, that's entirely outside the console, so if that's the end goal here, oxdb is definitely a tool for a different audience.

Is that what you're looking for? Or is the goal definitely to expose things in the API and console?

@askfongjojo
Copy link
Author

I don't intend to have more metrics exposed in the console. It sounds like the oxdb CLI already serves the purpose. So maybe we can document it as a support tool (e.g. under https://github.com/oxidecomputer/meta/tree/master/engineering/rack-support) and punt the API side of things.

@askfongjojo
Copy link
Author

Implemented in #5273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants