-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[system-health] Add support for monitoring system health #4835
[system-health] Add support for monitoring system health #4835
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…le; 2. set system led in library instead of daemon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Please wait for other reviewers.
@sujinmkang, does this look ok for you? if yes, would please approve? |
fe2f1be
retest this please |
retest vs please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it comes with tests :-)
@Junchao-Mellanox could you please resolve conflicts and once all tests pass we can merge |
466f983
@liat-grozovik resolved. |
) * system health first commit * system health daemon first commit * Finish healthd * Changes due to lower layer logic change * Get ASIC temperature from TEMPERATURE_INFO table * Add system health make rule and service files * fix bugs found during manual test * Change make file to install system-health library to host * Set system LED to blink on bootup time * Caught exceptions in system health checker to make it more robust * fix issue that fan/psu presence will always be true * fix issue for external checker * move system-health service to right after rc-local service * Set system-health service start after database service * Get system up time via /proc/uptime * Provide more information in stat for CLI to use * fix typo * Set default category to External for external checker * If external checker reported OK, save it to stat too * Trim string for external checker output * fix issue: PSU voltage check always return OK * Add unit test cases for system health library * Fix LGTM warnings * fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon * Remove boot_timeout configuration because it will get from monit config file * Fix argument miss * fix unit test failure * fix issue: summary status is not correct * Fix format issues found in code review * rename th to threshold to make it clearer * Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead * Fix unit test failure * Fix LGTM alert * Fix LGTM alert * Fix review comments * Fix review comment * 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker * Ignore check for unknown service type * Fix unit test issue * Rename user define checker to user defined checker * Rename user_define_checkers to user_defined_checkers for configuration file * Renmae file user_define_checker.py -> user_defined_checker.py * Fix typo * Adjust import order for config.py Co-authored-by: Joe LeVeque <[email protected]> * Adjust import order for src/system-health/health_checker/hardware_checker.py Co-authored-by: Joe LeVeque <[email protected]> * Adjust import order for src/system-health/scripts/healthd Co-authored-by: Joe LeVeque <[email protected]> * Adjust import orders in src/system-health/tests/test_system_health.py * Fix typo * Add new line after import * If system health configuration file not exist, healthd should exit * Fix indent and enable pytest coverage * Fix typo * Fix typo * Remove global logger and use log functions inherited from super class * Change info level logger to notice level Co-authored-by: Joe LeVeque <[email protected]>
- Why I did it
System health feature requires a service to collect system status and manage the system status LED.
- How I did it
- How to verify it
Manual test on SN2700. Regression test on SN2700, SN4600C, SN4700, SN3800, SN2100.
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)