transparency in node health/status information and state changes #112

FliesLikeABrick · 2021-04-08T14:02:00Z

While looking into how node liveness is determined (API's report of alive_ipv4 and alive_ipv6), I found myself wanting to be able to estimate the age of the health data for a system and understand if the current API response is reflective of the system health, or if a change is likely pending in the next 24 hours (next ring-admin run). One or more of the following would be helpful:

On a system, store more than just the latest status.json. This can be helpful for any user on the node to tell whether something on the system is intermittently unhealthy such as IPv4 or IPv6 connectivity. ring-health is run every 60 minutes from cron, but only the latest output is stored at /var/www/ring/status.json. Adding more history stored on the node would allow for investigation into what data may have changed since the last report to the API.
In the API, perhaps add a method and route to access the data from the health table? Speaking of, can someone provide the schema for the health table or add it to the SCHEMA in ring-admin?

With a bit of support, I can begin work on a PR for one or both of the items listed above, which could then enable research into some other contributions.

Other questions:

The logic in ring-admin to update alive_v4 and alive_v6 in the machines table is contained in ansible_process(). What cron job or other trigger results in ansible_process() being called? The closest cronjob I see is for purge machines however that appears to be a cleanup rather than calling ansible_process()
ring-admin will skip marking nodes as dead_v4/dead_v6 if more than 10 are detected down in a single run. Is there any visible report of this occurring or does this go to /dev/null? The concern would be if 10+ nodes legitimately fail in a single run (however often ansible_process() is run), that ring-admin will never be able to catch up on subsequent changes in state unless enough machines recover

The text was updated successfully, but these errors were encountered:

rodecker · 2021-04-09T08:37:09Z

Thanks for taking the time to look into this!

With 7a589af I've updated the schema in ring-admin to what it currently looks like. The summary field contains the JSON-structure with all the details submitted by ring-health. I chose not to add each individual health item to the schema to make it easier to add additional health checks.
The alive_v4 and alive_v6 fields are updated daily by https://github.com/NLNOG/ring-ansible/blob/master/roles/ansible/templates/cron.d/ansible_master.j2#L5. This job runs ansible_process(), which you already found. In addition to the aforementioned fields it updates the active field based on the last seen entry in the ansible table (indicating when a node completed its last ansible run).
The report is sent by e-mail to the ring-admins at https://github.com/NLNOG/ring-ansible/blob/master/roles/adminscripts/files/ring-admin#L837.

I like your idea of storing health history on the node itself. Exposing more health information in the API is something we thought of as well (#66) but have not yet gotten around to implementing.

FliesLikeABrick · 2021-04-14T00:56:57Z

@rodecker I'm looking at adding an API method and route for accessing the latest health report from a node. A few questions on direction:

What do you prefer for the URI: /1.0/nodes/[node id]/health_report -- or -- /1.0/health_report/[node id]/ (and maybe variants of each for hostname)? The former feels more intuitive but the latter is a bit more RESTful given the existing endpoint for POST /1.0/health_report. Or I suppose both could be easily added, just by using the @app.route decorator for both on the same method...
What is the best way to develop API code locally? Is there a database snapshot I can load into local mysql or something? I imagine standalone flask will work well enough to test; anything else needed besides the database?

rodecker · 2021-04-17T16:37:38Z

@FliesLikeABrick I agree /1.0/nodes/[node id]/health_report (or /health) would be more intuitive. But no harm in adding both.
I've sent you the details to set up a dev environment on IRC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transparency in node health/status information and state changes #112

transparency in node health/status information and state changes #112

FliesLikeABrick commented Apr 8, 2021 •

edited

Loading

rodecker commented Apr 9, 2021

FliesLikeABrick commented Apr 14, 2021

rodecker commented Apr 17, 2021

transparency in node health/status information and state changes #112

transparency in node health/status information and state changes #112

Comments

FliesLikeABrick commented Apr 8, 2021 • edited Loading

rodecker commented Apr 9, 2021

FliesLikeABrick commented Apr 14, 2021

rodecker commented Apr 17, 2021

FliesLikeABrick commented Apr 8, 2021 •

edited

Loading