-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[system-health] Make check interval more accurate #14085
[system-health] Make check interval more accurate #14085
Conversation
@prgeor could you please help to review or assign someone to do so? |
@Junchao-Mellanox can you elaborate more on this |
Currently, sonic-mgmt has a test case like this:
The case failed intermittently because sometimes 70 seconds is not enough. If we introduce a time compensation to healthd main loop, healthd loop interval will be more accurate and 70 seoncds is enough. BTW, time compensation is a common way in sonic. |
|
||
if self.stop_event.wait(manager.config.interval): | ||
break | ||
while self._run_checker(manager, chassis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously healthd was running continuously in while 1 loop. with this change the loop can break and healthd wont run. why this is changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous the loop still can be break, this logic does not change actually.
while 1:
stat = manager.check(chassis)
self._process_stat(chassis, manager.config, stat)
if self.stop_event.wait(manager.config.interval):
break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log_notice(f'System health takes {elapse} seconds for one iteration') | ||
sleep_time_in_sec = 1 | ||
if self.stop_event.wait(sleep_time_in_sec): | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if return 'False' healthd will stop. can we just break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if stop_event is set, it means that healthd got a signal, and it should stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liat-grozovik @lguohan please help merge |
- Why I did it Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval compensation mechanism to healthd main loop. - How I did it Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration - How to verify it Manual test Unit test
Cherry-pick PR to 202205: #14256 |
- Why I did it Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval compensation mechanism to healthd main loop. - How I did it Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration - How to verify it Manual test Unit test
- Why I did it Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval compensation mechanism to healthd main loop. - How I did it Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration - How to verify it Manual test Unit test
Cherry-pick PR to 202211: #14326 |
- Why I did it Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval compensation mechanism to healthd main loop. - How I did it Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration - How to verify it Manual test Unit test
Why I did it
Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.
How I did it
Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration
How to verify it
Manual test
Unit test
Which release branch to backport (provide reason below if selected)
Description for the changelog
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)