Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system-health] Make check interval more accurate #14085

Merged

Conversation

Junchao-Mellanox
Copy link
Collaborator

Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

How to verify it

Manual test
Unit test

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liat-grozovik
Copy link
Collaborator

@prgeor could you please help to review or assign someone to do so?

@prgeor
Copy link
Contributor

prgeor commented Mar 7, 2023

@Junchao-Mellanox can you elaborate more on this This implementation makes sonic-mgmt test case not so stable

@Junchao-Mellanox
Copy link
Collaborator Author

@Junchao-Mellanox can you elaborate more on this This implementation makes sonic-mgmt test case not so stable

Currently, sonic-mgmt has a test case like this:

  1. mock something, e.g. PSU high temperature
  2. wait 70 seconds
  3. check redis database and verify healthd put PSU high temperature error to STATE DB

The case failed intermittently because sometimes 70 seconds is not enough. If we introduce a time compensation to healthd main loop, healthd loop interval will be more accurate and 70 seoncds is enough. BTW, time compensation is a common way in sonic.


if self.stop_event.wait(manager.config.interval):
break
while self._run_checker(manager, chassis):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously healthd was running continuously in while 1 loop. with this change the loop can break and healthd wont run. why this is changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous the loop still can be break, this logic does not change actually.

            while 1:
                stat = manager.check(chassis)
                self._process_stat(chassis, manager.config, stat)

                if self.stop_event.wait(manager.config.interval):
                    break

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.log_notice(f'System health takes {elapse} seconds for one iteration')
sleep_time_in_sec = 1
if self.stop_event.wait(sleep_time_in_sec):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if return 'False' healthd will stop. can we just break?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if stop_event is set, it means that healthd got a signal, and it should stop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor
Copy link
Contributor

prgeor commented Mar 15, 2023

@liat-grozovik @lguohan please help merge

@liat-grozovik liat-grozovik merged commit 03cab99 into sonic-net:master Mar 15, 2023
@Junchao-Mellanox Junchao-Mellanox deleted the fix-healthd-interval branch March 15, 2023 06:04
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Mar 15, 2023
- Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

- How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

- How to verify it

Manual test
Unit test
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #14256

mssonicbld pushed a commit that referenced this pull request Mar 15, 2023
- Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

- How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

- How to verify it

Manual test
Unit test
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Mar 19, 2023
- Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

- How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

- How to verify it

Manual test
Unit test
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #14326

mssonicbld pushed a commit that referenced this pull request Mar 19, 2023
- Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

- How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

- How to verify it

Manual test
Unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants