Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop collecting Propolis metrics on instance stop #4495

Merged
merged 2 commits into from
Nov 14, 2023

Conversation

bnaecker
Copy link
Collaborator

When Nexus responds to a sled-agent notification that the instance is stopped and its Propolis server is gone, hard-delete the assignment record and ask oximeter to stop collecting from it.

When Nexus responds to a sled-agent notification that the instance is
stopped and its Propolis server is gone, hard-delete the assignment
record and ask `oximeter` to stop collecting from it.
@bnaecker bnaecker requested a review from gjcolombo November 14, 2023 18:11
@bnaecker
Copy link
Collaborator Author

This is the first commit in what I expect to be a series that makes metric producer/collector assignments more robust and flexible. It doesn't resolve #3808, but it prevents the problem from getting worse, by asking oximeter to stop collecting from Propolis servers when they are destroyed.

To test this, I built and installed the control plane on my Helios machine, and created an instance in the console. That instance has ID 8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5. Here is the state of the omicron.metric_producer table right before and after that instance creation:

root@[fd00:1122:3344:101::6]:32221/omicron> select * from metric_producer;
                   id                  |         time_created          |         time_modified         |          ip           | port  | interval |    base_route    |             oximeter_id
---------------------------------------+-------------------------------+-------------------------------+-----------------------+-------+----------+------------------+---------------------------------------
  271c2bb7-6ab8-497f-ba84-5b44f497bcd4 | 2023-11-13 23:45:44.595891+00 | 2023-11-13 23:45:44.595891+00 | fd00:1122:3344:101::a | 12221 |       10 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  3d53afcf-a2ea-4bde-be0b-ecce97107416 | 2023-11-13 23:45:59.76313+00  | 2023-11-13 23:45:59.76313+00  | fd00:1122:3344:101::b | 12221 |       10 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  5304e4f4-f44a-4949-8443-d22329f277b6 | 2023-11-13 23:45:59.768562+00 | 2023-11-13 23:45:59.768562+00 | fd00:1122:3344:101::c | 12221 |       10 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  7c8ed1d2-a49e-4a3a-8d3a-d2baae6c22f7 | 2023-11-13 23:41:47.819478+00 | 2023-11-13 23:41:47.819478+00 | fd00:1122:3344:101::1 | 12345 |       30 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  a13c3009-2639-4b1c-9c68-88261f976cc7 | 2023-11-13 23:41:00.160852+00 | 2023-11-13 23:41:00.160852+00 | fd00:1122:3344:101::2 | 12224 |       10 | /collect/data    | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
(5 rows)


Time: 2ms total (execution 1ms / network 0ms)

root@[fd00:1122:3344:101::6]:32221/omicron> select * from metric_producer;
                   id                  |         time_created          |         time_modified         |           ip           | port  | interval |    base_route    |             oximeter_id
---------------------------------------+-------------------------------+-------------------------------+------------------------+-------+----------+------------------+---------------------------------------
  271c2bb7-6ab8-497f-ba84-5b44f497bcd4 | 2023-11-13 23:45:44.595891+00 | 2023-11-13 23:45:44.595891+00 | fd00:1122:3344:101::a  | 12221 |       10 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  3d53afcf-a2ea-4bde-be0b-ecce97107416 | 2023-11-13 23:45:59.76313+00  | 2023-11-13 23:45:59.76313+00  | fd00:1122:3344:101::b  | 12221 |       10 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  5304e4f4-f44a-4949-8443-d22329f277b6 | 2023-11-13 23:45:59.768562+00 | 2023-11-13 23:45:59.768562+00 | fd00:1122:3344:101::c  | 12221 |       10 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  7c8ed1d2-a49e-4a3a-8d3a-d2baae6c22f7 | 2023-11-13 23:41:47.819478+00 | 2023-11-13 23:41:47.819478+00 | fd00:1122:3344:101::1  | 12345 |       30 | /metrics/collect | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5 | 2023-11-14 18:05:05.275408+00 | 2023-11-14 18:05:05.275408+00 | fd00:1122:3344:101::22 | 33853 |       30 | /collect         | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
  a13c3009-2639-4b1c-9c68-88261f976cc7 | 2023-11-13 23:41:00.160852+00 | 2023-11-13 23:41:00.160852+00 | fd00:1122:3344:101::2  | 12224 |       10 | /collect/data    | 8bd82845-d32a-4c7e-a00f-7b16915e8e66
(6 rows)


Time: 2ms total (execution 2ms / network 1ms)

root@[fd00:1122:3344:101::6]:32221/omicron>

And here is the oximeter log also showing that assignment:

{"msg":"accepted connection","v":0,"name":"oximeter","level":30,"time":"2023-11-14T18:05:05.285024965Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,
"local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:769","remo
te_addr":"[fd00:1122:3344:101::c]:32986"}
{"msg":"registered new metric producer","v":0,"name":"oximeter","level":20,"time":"2023-11-14T18:05:05.285120856Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,"collector_id":"8bd82845-d32a-4c7e-a00f-7b16915e8e66","component":"oximeter-agent","address":"[fd00:1122:3344:101::22]:33853","producer_id":"8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5"}
{"msg":"request completed","v":0,"name":"oximeter","level":30,"time":"2023-11-14T18:05:05.285172726Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,"uri":"/producers","method":"POST","req_id":"894699d3-6f44-4585-b17e-6afd4a273f4d","remote_addr":"[fd00:1122:3344:101::c]:32986","local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:853","latency_us":96,"response_code":"204"}
{"msg":"starting oximeter collection task","v":0,"name":"oximeter","level":20,"time":"2023-11-14T18:05:05.317104275Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,"producer_id":"8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5","component":"collection-task","collector_id":"8bd82845-d32a-4c7e-a00f-7b16915e8e66","component":"oximeter-agent","interval":"30s"}

Propolis registers using the instance ID as its primary producer ID, so we can see that in the oximeter logs under the producer_id key.

Next, I stopped the instance in the console, and we can see both the assignment record in the table is gone, and the oximeter logs indicate the removal:

root@[fd00:1122:3344:101::6]:32221/omicron> select * from metric_producer where id = '8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5';
  id | time_created | time_modified | ip | port | interval | base_route | oximeter_id
-----+--------------+---------------+----+------+----------+------------+--------------
(0 rows)


Time: 2ms total (execution 2ms / network 0ms)

And in oximeter:

{"msg":"accepted connection","v":0,"name":"oximeter","level":30,"time":"2023-11-14T18:06:36.919905984Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,
"local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:769","remo
te_addr":"[fd00:1122:3344:101::a]:60886"}
{"msg":"removed collection task from set","v":0,"name":"oximeter","level":20,"time":"2023-11-14T18:06:36.9279791Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,"collector_id":"8bd82845-d32a-4c7e-a00f-7b16915e8e66","component":"oximeter-agent","producer_id":"8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5"}
{"msg":"shut down collection task","v":0,"name":"oximeter","level":20,"time":"2023-11-14T18:06:36.92801427Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,"collector_id":"8bd82845-d32a-4c7e-a00f-7b16915e8e66","component":"oximeter-agent","producer_id":"8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5"}
{"msg":"request completed","v":0,"name":"oximeter","level":30,"time":"2023-11-14T18:06:36.9280253Z","hostname":"oxz_oximeter_8bd82845-d32a-4c7e-a00f-7b16915e8e66","pid":28466,"uri":"/producers/8c0ff8aa-ccf3-4242-bfef-43dc8541d5c5","method":"DELETE","req_id":"e8d34dbe-350c-476a-b44e-1aa64637cf2b","remote_addr":"[fd00:1122:3344:101::a]:60886","local_addr":"[fd00:1122:3344:101::d]:12223","component":"dropshot","file":"/home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:853","latency_us":8088,"response_code":"204"}

As a final test, I started the instance again, and things went back to the expected state, with oximeter reporting the new producer and the metric assignment table showing the record as well. Together, this should handle the happy path, when instances stop successfully, and prevent further ballooning of the oximeter heap and table.

In a follow-up PR, I'll include some work to remove existing assignments and update the table schema slightly to help keep track of producers during future automated updates.

nexus/src/app/instance.rs Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

Thanks @gjcolombo, I've added an item about ensuring metric registrations follow the instance's state in #3742, and have added a note referring to that in b866993. Appreciate the thoughtful review, as always!

@bnaecker
Copy link
Collaborator Author

This closes #3812

@bnaecker bnaecker merged commit a8a49c3 into main Nov 14, 2023
19 of 20 checks passed
@bnaecker bnaecker deleted the unregister-instance-metrics-on-stop branch November 14, 2023 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants