Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633

Merged
merged 2 commits into from
Oct 30, 2020

Conversation

Junchao-Mellanox
Copy link
Collaborator

- Why I did it

Found error logs in syslog:

Oct 10 12:41:09.148655 arc-switch1029 INFO pmon#supervisord 2020-10-10 12:41:07,689 INFO gave up: thermalctld entered FATAL state, too many start retries too quickly

The issue is related to the "startsecs" configuration of thermalctld in /etc/supervisor/conf.d/supervisord.conf. The current configuration setting the "startsecs" to 10, which means that it require thermalctld process running at least 10 seconds or supervisord will not restart it after it exiting even if the exit code is expected.

See the official document for "startsecs" at http://supervisord.org/configuration.html:

startsecs

The total number of seconds which the program needs to stay running after a startup to consider the start successful. If the program does not stay up for this many seconds after it has started, even if it exits with an “expected” exit code (see exitcodes), the startup will be considered a failure. Set to 0 to indicate that the program needn’t stay running for any particular amount of time.

- How I did it

The fix is to change the "startsecs" configuration from 10 to 0

- How to verify it

Manual test

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@jleveque
Copy link
Contributor

@Junchao-Mellanox: Usually we set startsecs=0 for one-shot processes, like scripts, which are expected to exit in under a second. However, thermalctld is a daemon and is meant to run indefinitely. Is this the proper fix? Under what conditions is thermalctl expected to exit quickly?

@Junchao-Mellanox
Copy link
Collaborator Author

@Junchao-Mellanox: Usually we set startsecs=0 for one-shot processes, like scripts, which are expected to exit in under a second. However, thermalctld is a daemon and is meant to run indefinitely. Is this the proper fix? Under what conditions is thermalctl expected to exit quickly?

Hi Joe, there is a test case which loads invalid thermal control configuration and verify that thermal control daemon won't crash. In that case, it restart the daemon after the checking, and it could start & kill the daemon in a very short time.

@jleveque
Copy link
Contributor

I see. Thanks for the info. I'm not sure we should change this simply to satisfy a specific test case. It sounds like it might be more proper to modify the test case?

@Junchao-Mellanox
Copy link
Collaborator Author

I see. Thanks for the info. I'm not sure we should change this simply to satisfy a specific test case. It sounds like it might be more proper to modify the test case?

We could add a delay in the test case, like "time.sleep(10)", not sure if it is a good solution. IMO, thermalctld is designed to help protect the system from being overheat, maybe we could set it to restart always. Any suggestion?

@jleveque
Copy link
Contributor

IMO, thermalctld is designed to help protect the system from being overheat, maybe we could set it to restart always. Any suggestion?

I think this is a valid concern. thermalctld is a critical process. I think it makes sense to be persistent at trying to get it running. I'm not opposed to setting it to restart always. @sujinmkang: Do you see any issues with configuring thermalctld to restart always?

@Junchao-Mellanox
Copy link
Collaborator Author

retest broadcom please

@liat-grozovik
Copy link
Collaborator

retest mellanox please

1 similar comment
@Junchao-Mellanox
Copy link
Collaborator Author

retest mellanox please

@sujinmkang
Copy link
Collaborator

@Junchao-Mellanox and @jleveque As we discussed in the email thread, if the thermal control configuration is critical for thermalctld to run, then thermalctld crashes and restart several times and then stops restarting, is right behavior instead of continuous restart, I think. If thermalctld can run some minimum checks without the configuration, it should give warning/error messages periodically but it should run without crash.
@Junchao-Mellanox Can you share the exact test steps and the expected outputs/states?

@Junchao-Mellanox
Copy link
Collaborator Author

@Junchao-Mellanox and @jleveque As we discussed in the email thread, if the thermal control configuration is critical for thermalctld to run, then thermalctld crashes and restart several times and then stops restarting, is right behavior instead of continuous restart, I think. If thermalctld can run some minimum checks without the configuration, it should give warning/error messages periodically but it should run without crash.
@Junchao-Mellanox Can you share the exact test steps and the expected outputs/states?

Hi @sujinmkang, the test case is like this:

  • Copy an invalid configuration file to switch
  • Kill thermalctld to make it load the invalid one
  • Make sure that thermalctld doesn't crash after reading an invalid configuration
  • Restore the configuration to a normal one
  • Kill thermalctld to make it load the normal one

So thermalctld itself doesn't crash, it is the test case which kill it manually.

@jleveque
Copy link
Contributor

Retest baseimage please

@jleveque
Copy link
Contributor

Retest mellanox please

@jleveque
Copy link
Contributor

@Junchao-Mellanox: Can you please update the PR title and description to match the new change?

@Junchao-Mellanox Junchao-Mellanox changed the title Fix issue: restart thermalctld too quick cause supervisord never restart it again Fix issue: enlarge startretries value to avoid thermalctld not able to restart during regression test Oct 26, 2020
@Junchao-Mellanox
Copy link
Collaborator Author

retest mellanox please

@jleveque jleveque requested a review from sujinmkang October 26, 2020 21:55
@keboliu
Copy link
Collaborator

keboliu commented Oct 29, 2020

@sujinmkang would you please help to review.

@jleveque jleveque changed the title Fix issue: enlarge startretries value to avoid thermalctld not able to restart during regression test [thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test Oct 30, 2020
@jleveque jleveque merged commit 781188f into sonic-net:master Oct 30, 2020
@keboliu
Copy link
Collaborator

keboliu commented Oct 31, 2020

@abdosi would you please help to cherry-pick?

abdosi pushed a commit that referenced this pull request Nov 3, 2020
…e to restart during regression test (#5633)

Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
@Junchao-Mellanox Junchao-Mellanox deleted the fix_thermalctld_restart branch December 15, 2020 01:42
santhosh-kt pushed a commit to santhosh-kt/sonic-buildimage that referenced this pull request Feb 25, 2021
…e to restart during regression test (sonic-net#5633)

Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants