Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transparency in node health/status information and state changes #112

Open
FliesLikeABrick opened this issue Apr 8, 2021 · 3 comments
Open

Comments

@FliesLikeABrick
Copy link

FliesLikeABrick commented Apr 8, 2021

While looking into how node liveness is determined (API's report of alive_ipv4 and alive_ipv6), I found myself wanting to be able to estimate the age of the health data for a system and understand if the current API response is reflective of the system health, or if a change is likely pending in the next 24 hours (next ring-admin run). One or more of the following would be helpful:

  • On a system, store more than just the latest status.json. This can be helpful for any user on the node to tell whether something on the system is intermittently unhealthy such as IPv4 or IPv6 connectivity. ring-health is run every 60 minutes from cron, but only the latest output is stored at /var/www/ring/status.json. Adding more history stored on the node would allow for investigation into what data may have changed since the last report to the API.
  • In the API, perhaps add a method and route to access the data from the health table? Speaking of, can someone provide the schema for the health table or add it to the SCHEMA in ring-admin?

With a bit of support, I can begin work on a PR for one or both of the items listed above, which could then enable research into some other contributions.

Other questions:

  • The logic in ring-admin to update alive_v4 and alive_v6 in the machines table is contained in ansible_process(). What cron job or other trigger results in ansible_process() being called? The closest cronjob I see is for purge machines however that appears to be a cleanup rather than calling ansible_process()
  • ring-admin will skip marking nodes as dead_v4/dead_v6 if more than 10 are detected down in a single run. Is there any visible report of this occurring or does this go to /dev/null? The concern would be if 10+ nodes legitimately fail in a single run (however often ansible_process() is run), that ring-admin will never be able to catch up on subsequent changes in state unless enough machines recover
@rodecker
Copy link
Member

rodecker commented Apr 9, 2021

Thanks for taking the time to look into this!

I like your idea of storing health history on the node itself. Exposing more health information in the API is something we thought of as well (#66) but have not yet gotten around to implementing.

@FliesLikeABrick
Copy link
Author

@rodecker I'm looking at adding an API method and route for accessing the latest health report from a node. A few questions on direction:

  • What do you prefer for the URI: /1.0/nodes/[node id]/health_report -- or -- /1.0/health_report/[node id]/ (and maybe variants of each for hostname)? The former feels more intuitive but the latter is a bit more RESTful given the existing endpoint for POST /1.0/health_report. Or I suppose both could be easily added, just by using the @app.route decorator for both on the same method...
  • What is the best way to develop API code locally? Is there a database snapshot I can load into local mysql or something? I imagine standalone flask will work well enough to test; anything else needed besides the database?

@rodecker
Copy link
Member

@FliesLikeABrick I agree /1.0/nodes/[node id]/health_report (or /health) would be more intuitive. But no harm in adding both.
I've sent you the details to set up a dev environment on IRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants