An Oxide health check endpoint for customer monitoring system to use #3923

askfongjojo · 2023-08-21T18:05:33Z

We don't currently provide an API endpoint for customer's monitoring system to use for general health checks. The API should provide:

200 or 50x response
error message for failed status check
optional configurable timeout

One thing that we may want to consider is whether the API can be used with silo-specific endpoints or Recovery silo endpoint only, or both. From a Nexus monitoring perspective, the check against any silo endpoint should suffice. There is also a chance that a particular silo endpoint doesn't work because of an external DNS issue (which is uncommon) so there may be some value for customer to check all silo endpoints.

david-crespo · 2023-08-21T18:06:21Z

I have a tiny draft for this, will post it soon.

Closes #3923 Adds `/v1/ping` that always returns `{ "status": "ok" }` if it returns anything at all. I went with `ping` over the initial `/v1/system/health` because the latter is vague about its meaning, whereas everyone know ping means a trivial request and response. I also thought it was weird to put an endpoint with no auth check under `/v1/system`, where ~all the other endpoints require fleet-level perms. This doesn't add too much over hitting an existing endpoint, but I think it's worth it because * It doesn't hit the DB * It has no auth check * It gives a very simple answer to "what endpoint should I use to ping the API?" (a question we have gotten at least once) * It's easy (I already did it) Questions that occurred to me while working through this: - Should we actually attempt to do something in the handler that would tell us, e.g., whether the DB is up? - No, that would be more than a ping - Raises DoS questions if not auth gated - Could add a db status endpoint or or you could use any endpoint that returns data - What tag should this be under? - Initially added a `system` tag because a) this doesn't fit under existing `system/blah` tags and b) it really does feel miscellaneous - Changed to `system/status`, with the idea that if we add other kinds of checks, they would be new endpoints under this tag.

askfongjojo added this to the 1.0.3 milestone Aug 21, 2023

david-crespo mentioned this issue Aug 21, 2023

[nexus] Add /v1/ping endpoint #3925

Merged

askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023

morlandi7 assigned david-crespo Oct 3, 2023

david-crespo closed this as completed in #3925 Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An Oxide health check endpoint for customer monitoring system to use #3923

An Oxide health check endpoint for customer monitoring system to use #3923

askfongjojo commented Aug 21, 2023

david-crespo commented Aug 21, 2023

An Oxide health check endpoint for customer monitoring system to use #3923

An Oxide health check endpoint for customer monitoring system to use #3923

Comments

askfongjojo commented Aug 21, 2023

david-crespo commented Aug 21, 2023