Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Oxide health check endpoint for customer monitoring system to use #3923

Closed
askfongjojo opened this issue Aug 21, 2023 · 1 comment · Fixed by #3925
Closed

An Oxide health check endpoint for customer monitoring system to use #3923

askfongjojo opened this issue Aug 21, 2023 · 1 comment · Fixed by #3925
Assignees
Milestone

Comments

@askfongjojo
Copy link

We don't currently provide an API endpoint for customer's monitoring system to use for general health checks. The API should provide:

  • 200 or 50x response
  • error message for failed status check
  • optional configurable timeout

One thing that we may want to consider is whether the API can be used with silo-specific endpoints or Recovery silo endpoint only, or both. From a Nexus monitoring perspective, the check against any silo endpoint should suffice. There is also a chance that a particular silo endpoint doesn't work because of an external DNS issue (which is uncommon) so there may be some value for customer to check all silo endpoints.

@askfongjojo askfongjojo added this to the 1.0.3 milestone Aug 21, 2023
@david-crespo
Copy link
Contributor

I have a tiny draft for this, will post it soon.

@askfongjojo askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023
david-crespo added a commit that referenced this issue Oct 9, 2023
Closes #3923 

Adds `/v1/ping` that always returns `{ "status": "ok" }` if it returns
anything at all. I went with `ping` over the initial `/v1/system/health`
because the latter is vague about its meaning, whereas everyone know
ping means a trivial request and response. I also thought it was weird
to put an endpoint with no auth check under `/v1/system`, where ~all the
other endpoints require fleet-level perms.

This doesn't add too much over hitting an existing endpoint, but I think
it's worth it because

* It doesn't hit the DB
* It has no auth check
* It gives a very simple answer to "what endpoint should I use to ping
the API?" (a question we have gotten at least once)
* It's easy (I already did it)

Questions that occurred to me while working through this:

- Should we actually attempt to do something in the handler that would
tell us, e.g., whether the DB is up?
  - No, that would be more than a ping
  - Raises DoS questions if not auth gated
- Could add a db status endpoint or or you could use any endpoint that
returns data
- What tag should this be under? 
- Initially added a `system` tag because a) this doesn't fit under
existing `system/blah` tags and b) it really does feel miscellaneous
- Changed to `system/status`, with the idea that if we add other kinds
of checks, they would be new endpoints under this tag.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants