Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support missing samples in oximeter #4552

Merged
merged 2 commits into from
Dec 4, 2023
Merged

Conversation

bnaecker
Copy link
Collaborator

  • Add a Datum::Missing and MissingDatum, which records the intended datum type and an optional start time for a sample which could not be produced.
  • Database upgrades which make all scalar datum columns Nullable. Array fields are not made Nullable, since ClickHouse doesn't support composite types like arrays inside a Nullable wrapper type. The empty array is used as a sentinel, which is OK since we can't have zero-length array histograms in Oximeter. Add a test which will fail if we ever change that.
  • Rework database serialization to handle Nullable types or empty arrays. This uses a new helper trait to convert a NULL (which has no type information) to the intended datum type, or an empty array to a histogram.
  • Add a test for each measurement type that we can recover a missing sample of that type -- NULLs for scalar values and empty arrays for histograms.

@bnaecker
Copy link
Collaborator Author

Fixes #4311

@bnaecker bnaecker force-pushed the support-oximeter-missing-samples branch 2 times, most recently from d5095fd to e1f41f3 Compare November 29, 2023 19:14
@bnaecker
Copy link
Collaborator Author

This is a pretty substantive change, so I'm including some manual testing notes in addition to the new tests I added as part of the PR.

The big change here is that we've made most columns nullable in the ClickHouse tables that store individual measurements. The point is to store NULL in those records where we expected a measurement, but one could not be produced for some reason. I.e., this represents a missing sample.

Testing overview

Here's the general test flow:

  1. Launch Omicron from main
  2. Rebuild just the oximeter components from this PR branch
  3. Install those into the oximeter zone
  4. Upgrade the database to the new version, which makes the columns nullable.
  5. Make sure things are still working

Starting from main

I first built and launched the control plane from main (bb7ee84) on my home Helios machine. All the zones came up normally:

bnaecker@shale : ~/omicron $ zoneadm list
global
sidecar_softnpu
oxz_switch
oxz_internal_dns_5be28091-5094-46e1-b409-eeb732838df4
oxz_internal_dns_332bdd3f-9820-4ed2-ad9c-7ea7ea3a62cb
oxz_internal_dns_0168bba0-ec6c-4616-9ed9-a80512b8d867
oxz_ntp_91bd6362-c509-45f5-8802-e78a4ae06979
oxz_cockroachdb_ba599cfe-448f-4474-8ce6-9034eb30d9bc
oxz_cockroachdb_cb288fab-e100-4f65-bf43-9a93801ba79e
oxz_cockroachdb_9f668775-d633-409e-a081-5d8eb9c13f91
oxz_cockroachdb_3c14ecf5-a43b-436d-9501-c3a0543962a2
oxz_cockroachdb_13220500-7085-45ec-80bd-60c9d7fd3144
oxz_external_dns_df0db58b-542d-417a-ada1-56e8c8298db2
oxz_crucible_488e5492-e373-4b0c-b5d6-9ebfaf55b37a
oxz_crucible_8bbf062e-9178-4c38-9225-6747421c8fde
oxz_nexus_2525f7ae-12f2-40f1-8153-26233072120b
oxz_external_dns_1f93a20c-87c8-4518-b03c-e46888936df6
oxz_nexus_0d1fb90a-56f3-4b7b-9b3e-88448ea90294
oxz_crucible_eb3cd716-f2f9-4a4b-904d-7d38eb1c95ce
oxz_crucible_pantry_07bbc083-d600-43ed-8a3a-038728f99601
oxz_crucible_27675c89-d960-4940-af20-3fe7a40d9513
oxz_crucible_pantry_29b74054-231a-4ce3-9f74-6ee0ea9d8259
oxz_crucible_84e1a941-5a99-4bc4-935d-517fd75c4abe
oxz_crucible_e5af8133-7e8b-446a-b866-8342e4ede849
oxz_crucible_fba0d822-2ed4-4e8d-9673-5acf12311cf2
oxz_nexus_bd38ac04-7929-4f98-87c3-5ecfd10bf340
oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73
oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944
oxz_crucible_pantry_722d7060-8a00-4675-948b-9f956157e872
oxz_crucible_e0b2015a-c457-4419-8209-03f56de265cd
oxz_crucible_99721c20-bfd5-4f30-a4e3-9b4aae952a1a

Here is the current set of tables, including the database version and a create-table statement:

bnaecker@shale : ~/omicron $ pfexec zlogin oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73
[Connected to zone 'oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73' pts/1]
Last login: Wed Nov 29 19:53:49 on pts/1
The illumos Project     helios-2.0.22248        October 2023
root@oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
oxControlService23/ll addrconf ok       fe80::8:20ff:fe22:2a48%oxControlService23/10
oxControlService23/omicron6 static ok   fd00:1122:3344:101::e/64
root@oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73:~# /opt/oxide/clickhouse/clickhouse client --database oximeter --host fd00:1122:3344:101::e
ClickHouse client version 22.8.9.1.
Connecting to database oximeter at fd00:1122:3344:101::e:9000 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.

oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73 :) show tables;

SHOW TABLES

Query id: 56b07bd8-f71c-412f-a29f-0ae33d8fe732

┌─name───────────────────────┐
│ fields_bool                │
│ fields_i16                 │
│ fields_i32                 │
│ fields_i64                 │
│ fields_i8                  │
│ fields_ipaddr              │
│ fields_string              │
│ fields_u16                 │
│ fields_u32                 │
│ fields_u64                 │
│ fields_u8                  │
│ fields_uuid                │
│ measurements_bool          │
│ measurements_bytes         │
│ measurements_cumulativef32 │
│ measurements_cumulativef64 │
│ measurements_cumulativei64 │
│ measurements_cumulativeu64 │
│ measurements_f32           │
│ measurements_f64           │
│ measurements_histogramf32  │
│ measurements_histogramf64  │
│ measurements_histogrami16  │
│ measurements_histogrami32  │
│ measurements_histogrami64  │
│ measurements_histogrami8   │
│ measurements_histogramu16  │
│ measurements_histogramu32  │
│ measurements_histogramu64  │
│ measurements_histogramu8   │
│ measurements_i16           │
│ measurements_i32           │
│ measurements_i64           │
│ measurements_i8            │
│ measurements_string        │
│ measurements_u16           │
│ measurements_u32           │
│ measurements_u64           │
│ measurements_u8            │
│ timeseries_schema          │
│ version                    │
└────────────────────────────┘

41 rows in set. Elapsed: 0.009 sec.

oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73 :) select * from version;

SELECT *
FROM version

Query id: 38f1a659-015c-4e89-9e4d-73549a2e46cb

┌─value─┬─────────────────────timestamp─┐
│     3 │ 2023-11-29 19:48:04.000000000 │
└───────┴───────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73 :) show create table measurements_bool;

SHOW CREATE TABLE measurements_bool

Query id: 83d4a54e-95b5-4c84-b2b2-30cc5322d9d0

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE oximeter.measurements_bool
(
    `timeseries_name` String,
    `timeseries_key` UInt64,
    `timestamp` DateTime64(9, 'UTC'),
    `datum` UInt8
)
ENGINE = MergeTree
ORDER BY (timeseries_name, timeseries_key, timestamp)
TTL toDateTime(timestamp) + toIntervalDay(30)
SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.001 sec.

oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73 :)

Install the new oximeter zone

I rebuilt everything from this PR branch, and manually copied in the new SQL upgrade files and oximeter collector into the zone:

bnaecker@shale : ~/omicron $ zonecfg -z oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944 info
zonename: oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944
zonepath: /pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944
brand: omicron1
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: exclusive
hostid:
fs-allowed:
net:
	address not specified
	allowed-address not specified
	physical: oxControlService22
	defrouter not specified
bnaecker@shale : ~/omicron $ pfexec cp -r ./oximeter/db/schema/single-node/4 /pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944/root/opt/oxide/oximeter/schema/single-node/
bnaecker@shale : ~/omicron $ pfexec cp ./oximeter/db/schema/single-node/db-init.sql /pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/
debug/ zone/
bnaecker@shale : ~/omicron $ pfexec cp ./oximeter/db/schema/single-node/db-init.sql /pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944/root/opt/oxide/oximeter/schema/single-node/
bnaecker@shale : ~/omicron $ pfexec zlogin oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944 svcadm disable oximeter
bnaecker@shale : ~/omicron $ pfexec cp ./target/release/oximeter /pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944/root/opt/oxide/oximeter/bin/oximeter
bnaecker@shale : ~/omicron $ pfexec zlogin oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944 svcadm enable oximeter
bnaecker@shale : ~/omicron $ pfexec zlogin oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944
[Connected to zone 'oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944' pts/1]
Last login: Wed Nov 29 20:03:28 on pts/1
The illumos Project	helios-2.0.22248	October 2023
root@oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944:~# tail $(svcs -L oximeter)
{"msg":"failed to create ClickHouse client","v":0,"name":"oximeter","level":40,"time":"2023-11-29T20:08:47.729984033Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"file":"oximeter/collector/src/lib.rs:213","error":"Database(DatabaseVersionMismatch { expected: 4, found: 3 })","retry_after":"1.484488595s"}
{"msg":"creating ClickHouse client","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:08:49.215077417Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232}
{"msg":"lookup srv","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:08:49.215122099Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"component":"DnsResolver","dns_name":"_clickhouse._tcp.control-plane.oxide.internal"}
{"msg":"failed to create ClickHouse client","v":0,"name":"oximeter","level":40,"time":"2023-11-29T20:08:49.246122072Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"file":"oximeter/collector/src/lib.rs:213","error":"Database(DatabaseVersionMismatch { expected: 4, found: 3 })","retry_after":"1.092047245s"}
{"msg":"creating ClickHouse client","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:08:50.33920794Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232}
{"msg":"lookup srv","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:08:50.339253021Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"component":"DnsResolver","dns_name":"_clickhouse._tcp.control-plane.oxide.internal"}
{"msg":"failed to create ClickHouse client","v":0,"name":"oximeter","level":40,"time":"2023-11-29T20:08:50.369147992Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"file":"oximeter/collector/src/lib.rs:213","error":"Database(DatabaseVersionMismatch { expected: 4, found: 3 })","retry_after":"3.867146995s"}
{"msg":"creating ClickHouse client","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:08:54.23725545Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232}
{"msg":"lookup srv","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:08:54.237297971Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"component":"DnsResolver","dns_name":"_clickhouse._tcp.control-plane.oxide.internal"}
{"msg":"failed to create ClickHouse client","v":0,"name":"oximeter","level":40,"time":"2023-11-29T20:08:54.278363661Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"file":"oximeter/collector/src/lib.rs:213","error":"Database(DatabaseVersionMismatch { expected: 4, found: 3 })","retry_after":"8.279137857s"}
root@oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944:~#

After re-enabling the oximeter service, you can see that it's hanging around waiting for the database to be updated, which I did next:

root@oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944:~# /opt/oxide/oximeter/bin/clickhouse-schema-updater --host [fd00:1122:3344:101::e]:8123 ls
Latest version: 3
Available versions:
 2
 3 (reported by database) (expected by oximeter)
 4
root@oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944:~# /opt/oxide/oximeter/bin/clickhouse-schema-updater --host [fd00:1122:3344:101::e]:8123 up 4
Upgrade to oximeter database version 4 complete
root@oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944:~# tail $(svcs -L oximeter)
{"msg":"request completed","v":0,"name":"oximeter","level":30,"time":"2023-11-29T20:10:02.856400617Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"uri":"/producers","method":"POST","req_id":"b03c9b7f-fdfd-4b8f-bfba-97e1fc8fd7ec","remote_addr":"[fd00:1122:3344:101::c]:54168","local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:853","latency_us":78,"response_code":"204"}
{"msg":"registered new metric producer","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:10:02.856610103Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"collector_id":"975dc124-69d1-40f8-939d-2065b9d63944","component":"oximeter-agent","address":"[fd00:1122:3344:101::1]:12345","producer_id":"495fe58c-48bc-4eea-abab-bfa47a54ca49"}
{"msg":"request completed","v":0,"name":"oximeter","level":30,"time":"2023-11-29T20:10:02.856632394Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"uri":"/producers","method":"POST","req_id":"1ebcd252-98b5-4115-8638-8ad6730fd81d","remote_addr":"[fd00:1122:3344:101::c]:54168","local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:853","latency_us":57,"response_code":"204"}
{"msg":"registered new metric producer","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:10:02.856783248Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"collector_id":"975dc124-69d1-40f8-939d-2065b9d63944","component":"oximeter-agent","address":"[fd00:1122:3344:101::c]:12221","producer_id":"bd38ac04-7929-4f98-87c3-5ecfd10bf340"}
{"msg":"request completed","v":0,"name":"oximeter","level":30,"time":"2023-11-29T20:10:02.856806748Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"uri":"/producers","method":"POST","req_id":"50f372b6-f60a-4adf-9ec8-288ff9fc9830","remote_addr":"[fd00:1122:3344:101::c]:54168","local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/ff87a01/dropshot/src/server.rs:853","latency_us":60,"response_code":"204"}
{"msg":"oximeter registered with nexus","v":0,"name":"oximeter","level":30,"time":"2023-11-29T20:10:02.860108293Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"file":"oximeter/collector/src/lib.rs:285","id":"975dc124-69d1-40f8-939d-2065b9d63944"}
{"msg":"starting oximeter collection task","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:10:02.931572768Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"producer_id":"495fe58c-48bc-4eea-abab-bfa47a54ca49","component":"collection-task","collector_id":"975dc124-69d1-40f8-939d-2065b9d63944","component":"oximeter-agent","interval":"30s"}
{"msg":"starting oximeter collection task","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:10:02.932636739Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"producer_id":"bd38ac04-7929-4f98-87c3-5ecfd10bf340","component":"collection-task","collector_id":"975dc124-69d1-40f8-939d-2065b9d63944","component":"oximeter-agent","interval":"10s"}
{"msg":"starting oximeter collection task","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:10:02.937715693Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"producer_id":"2525f7ae-12f2-40f1-8153-26233072120b","component":"collection-task","collector_id":"975dc124-69d1-40f8-939d-2065b9d63944","component":"oximeter-agent","interval":"10s"}
{"msg":"starting oximeter collection task","v":0,"name":"oximeter","level":20,"time":"2023-11-29T20:10:02.938738612Z","hostname":"oxz_oximeter_975dc124-69d1-40f8-939d-2065b9d63944","pid":27232,"producer_id":"0d1fb90a-56f3-4b7b-9b3e-88448ea90294","component":"collection-task","collector_id":"975dc124-69d1-40f8-939d-2065b9d63944","component":"oximeter-agent","interval":"10s"}

We can see that oximeter has moved past the point where it's waiting for the database to match, and has started some of its collection tasks. The database has also been updated successfully, since it now reports the correct version (4), and the datum columns are not Nullable(T) rather than just T.

bnaecker@shale : ~/omicron $ pfexec zlogin oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73
[Connected to zone 'oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73' pts/8]
Last login: Wed Nov 29 20:10:28 on pts/8
The illumos Project     helios-2.0.22248        October 2023
root@oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73:~# /opt/oxide/clickhouse/clickhouse client --database oximeter --host fd00:1122:3344:101::e
ClickHouse client version 22.8.9.1.
Connecting to database oximeter at fd00:1122:3344:101::e:9000 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.

oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73 :) select * from version;

SELECT *
FROM version

Query id: d4063927-bc91-40e0-bd13-89a26709b432

┌─value─┬─────────────────────timestamp─┐
│     4 │ 2023-11-29 20:09:56.000000000 │
└───────┴───────────────────────────────┘
┌─value─┬─────────────────────timestamp─┐
│     3 │ 2023-11-29 19:48:04.000000000 │
└───────┴───────────────────────────────┘

2 rows in set. Elapsed: 0.003 sec.

oxz_clickhouse_43dc9f8b-e50b-4f02-a54a-8e5a115a0c73 :) show create table measurements_u8;

SHOW CREATE TABLE measurements_u8

Query id: fb2c1fdc-8499-4dfb-a157-ba476ca57816

┌─statement──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE oximeter.measurements_u8
(
    `timeseries_name` String,
    `timeseries_key` UInt64,
    `timestamp` DateTime64(9, 'UTC'),
    `datum` Nullable(UInt8)
)
ENGINE = MergeTree
ORDER BY (timeseries_name, timeseries_key, timestamp)
TTL toDateTime(timestamp) + toIntervalDay(30)
SETTINGS index_granularity = 8192 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.001 sec.

oximeter is happily collecting measurements and inserting them into the right table as before.

- Add a `Datum::Missing` and `MissingDatum`, which records the intended
  datum type and an optional start time for a sample which could not be
  produced.
- Database upgrades which make all scalar datum columns Nullable. Array
  fields are _not_ made Nullable, since ClickHouse doesn't support
  composite types like arrays inside a Nullable wrapper type. The empty
  array is used as a sentinel, which is OK since we can't have
  zero-length array histograms in Oximeter. Add a test which will fail
  if we ever change that.
- Rework database serialization to handle Nullable types or empty
  arrays. This uses a new helper trait to convert a NULL (which has no
  type information) to the intended datum type, or an empty array to a
  histogram.
- Add a test for each measurement type that we can recover a missing
  sample of that type -- NULLs for scalar values and empty arrays for
  histograms.
@bnaecker bnaecker force-pushed the support-oximeter-missing-samples branch from e1f41f3 to 8b5abc4 Compare November 29, 2023 20:39
@bnaecker bnaecker requested review from ahl and smklein November 29, 2023 21:54
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only a few minor cleanup comments.

I can save this question for #4311 , but what's your idea on how this gets presented to end-users?

Put another way, if someone queries for samples between times A and B, and they see:

  • "No samples"
    vs
  • "We have five samples of missing data"

What's that supposed to mean?

I know you discussed the possibility of a "measurement failure" but I'm not really wrapping my head around the end-to-end distinction between these two cases. Is this something metric producers will need to start doing implicitly? (e.g., "no samples" actually means that "missing data" is recorded?)

oximeter/db/src/client.rs Show resolved Hide resolved
oximeter/db/src/model.rs Show resolved Hide resolved
oximeter/oximeter/src/types.rs Outdated Show resolved Hide resolved
oximeter/oximeter/src/types.rs Outdated Show resolved Hide resolved
oximeter/db/src/model.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

bnaecker commented Dec 4, 2023

what's your idea on how this gets presented to end-users?

Put another way, if someone queries for samples between times A and B, and they see:

* "No samples"
  vs

* "We have five samples of missing data"

What's that supposed to mean?

To be transparent, I'm not yet sure how this will be communicated to customers. It actually may not be in the end, or maybe only to operators. But I do think it's important that we internally have a record of a missing sample. Right now, producers can generate an error, which is only logged in the oximeter collector output, but never persisted. This at least lets us internally understand how our systems are working, even if the missing records are filtered out in the console or the querying API or through some other mechanism.

UPDATE: I didn't quite answer your question :) The difference between "No samples" and "5 missing samples" is, "Hmm something isn't working correctly because it couldn't collect the data it expected". On a graph, this probably be "shown" as NaNs, which is to say "no value on the y-axis, but possibly shown in some other way". For example, this shows 4 valid data points at (0, 1, 3, 4), and one missing sample at x == 2.

missing

That's one possible way someone could see that there is missing data, rather than a different scenario such as when the system didn't even attempt to produce samples.

I know you discussed the possibility of a "measurement failure" but I'm not really wrapping my head around the end-to-end distinction between these two cases. Is this something metric producers will need to start doing implicitly? (e.g., "no samples" actually means that "missing data" is recorded?)

The goal was to be less implicit, not more so. Right now, if there is a failure to collect a sample, producers can send a MetricsError to oximeter. That is logged on the collection side, but nothing else is done. I am hoping that instead, producers will opt to generate an actual MetricsError much less frequently -- such as when they encounter the equivalent of a 500, or a violation of some invariant. I would hope that to the extent possible, we produce missing samples instead, since those provide nearly as much data, but are also available outside the producer and collector processes through the database.

Hope that helps provide some more context! Let me know your thoughts about all that.

@smklein
Copy link
Collaborator

smklein commented Dec 4, 2023

I am hoping that instead, producers will opt to generate an actual MetricsError much less frequently -- such as when they encounter the equivalent of a 500, or a violation of some invariant. I would hope that to the extent possible, we produce missing samples instead, since those provide nearly as much data, but are also available outside the producer and collector processes through the database.

Got it, this makes sense, and answers a big part of my question: right now, if we merge this with no other changes, we wouldn't see any missing samples. But the next step would be finding out how producers could/should emit this information.

Notably, this means that we still need connectivity between collector -> producer in order to log these metrics, correct? If a producer was running within a service on a sled that goes offline, it wouldn't actually produce any "missing" samples, right?

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mechanically, I'm happy with this PR, and I'm okay merging this as-is to get the schema changes in-place.

Logistically, I still have some questions about:

  1. Best practices for emitting missing metrics (always producer-triggered? What if the producer's service crashes? It might be worthwhile discussing a collector-triggered missing metric, if the producer is not responding to us)
  2. How we're displaying this in the UI -- though your sample graph convinced me this is a tractable problem!

@bnaecker
Copy link
Collaborator Author

bnaecker commented Dec 4, 2023

Notably, this means that we still need connectivity between collector -> producer in order to log these metrics, correct? If a producer was running within a service on a sled that goes offline, it wouldn't actually produce any "missing" samples, right?

Correct. A missing sample still needs to be produced, oximeter won't ever cons them up for a producer. Or as you said, nothing at all will change today with this merge, since nothing produces missing samples.

What if the producer's service crashes? It might be worthwhile discussing a collector-triggered missing metric, if the producer is not responding to us)

If oximeter itself fails to reach a producer, that's incorporated into the self-statistics that oximeter keeps, under the oximeter_collector:failed_collections timeseries. That is broken down by a few fields that should let us figure out exactly which service or process oximeter couldn't get a hold of.

Best practices for emitting missing metrics

That is a good idea. I think "best effort" is the most accurate suggestion I have right now. Basically, in priority order I would suggest:

  1. Produce a non-missing sample
  2. If that fails, and it's possible, generate a missing sample
  3. Send a MetricsError
  4. Log an error yourself

@bnaecker bnaecker merged commit 9b666e7 into main Dec 4, 2023
22 checks passed
@bnaecker bnaecker deleted the support-oximeter-missing-samples branch December 4, 2023 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants