-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pytest] Test the feature of container checker. #2890
[pytest] Test the feature of container checker. #2890
Conversation
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
/Azurepipelines run |
Pull request contains merge conflicts. |
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Retest vsimage please. |
the test costs 15 minutes, is there way to reduce the test time? |
/Azurepipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
# Wait for 6 minutes such that Monit has a chance to write alerting message into syslog. | ||
logger.info("Sleep 6 minutes to wait for the alerting message...") | ||
time.sleep(360) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems very long? why do we need to wait for 6 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the script container_checker
was ran periodically by Monit and if one of containers was not running, Monit will write an alerting message into syslog after 5 minutes. So at here this pytest script has to wait for 6 minutes and then check whether the alerting message appeared in syslog or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reduce monit duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the Monit configuration of script container_checker
in image is as follows:
check program container_checker with path "/usr/bin/container_checker"
if status != 0 for 5 times within 5 cycles then alert repeat every 1 cycles
Can we change the configuration like this? Then I can reduce the waiting time at here.
check program container_checker with path "/usr/bin/container_checker"
if status != 0 for 2 times within 2 cycles then alert repeat every 1 cycles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, change to 2 cycles in the test and change it back after test finishes.
Signed-off-by: Yong Zhao <[email protected]>
check_alerting_message(duthost, stopped_container_list) | ||
|
||
logger.info("Executing the config reload...") | ||
config_reload(duthost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bingwang-ms @wangxin Please help me review this PR again. The previous PR (#2852) was reverted due to BGP sessions were down after all containers are restarted and then it failed the post-check on KVM testbed. The root reason is I should not restart the containers according to the sequence of stopped containers. The correct way is to use config_reload(...)
to restart the containers since some containers are depended on others. After running the config_reload(...)
, the line 255 will check whether all stopped containers are running or not and it will restart some containers in case the config_reload(...)
did not restart them such as the container mgmt-framework
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Except for the time.sleep(360)
that guohan has mentioned. Please update the monit config for container_checker and shorten the sleep time. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Bing! Please help me review the function update_monit_service(...)
and the line 274.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to do config_reload here? is config_reload_after_tests not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script will first do the config_reload(...)
to start the stopped containers and then do the post check to check whether BGP session are established and all critical processes are running,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move config_reload and the checking below to config_reload_after_tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxieca This pytest script will skip the DuTs which were installed 201911 image. If this pytest script was ran against a DuT which was installed public image and if I removed the config_reload(...)
at here and only invoked config_reload(...)
in the fixture config_reload_after_test(...)
, then the mgmt-framework
container will not be restarted. That is why at there this script will first invoke config_reload(..)
and use the function check_containers_status(...)
to restart the containers which can not be restarted by config_reload(...)
.
Signed-off-by: Yong Zhao <[email protected]>
None. | ||
""" | ||
logger.info("Reduing the monitoring interval of container_checker.") | ||
duthost.shell("sudo sed -i '$s/5/2/g' /etc/monit/conf.d/sonic-host") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code here is fragile, if sonic image change to 6 minutes, then it will break here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
…n 1. Signed-off-by: Yong Zhao <[email protected]>
None. | ||
""" | ||
logger.info("Reduing the monitoring interval of container_checker.") | ||
duthost.shell("sudo sed -i '$s/[2-9]\|[1-9][0-9]\+/2/g' /etc/monit/conf.d/sonic-host") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lguohan Please help me review again!
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
return stopped_container_list | ||
|
||
|
||
def check_alerting_message(duthost, stopped_container_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, use log analyzer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
with 201911 or old image versions. Signed-off-by: Yong Zhao <[email protected]>
roll them back after testing. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
files. Use `mv` instead of `cp -f` + `rm -f` to restore Monit configuration files. Signed-off-by: Yong Zhao <[email protected]>
overwrite. Signed-off-by: Yong Zhao <[email protected]>
Signed-off-by: Yong Zhao <[email protected]>
the post checking. Signed-off-by: Yong Zhao <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Yong Zhao <[email protected]>
…ck. (#3171) Signed-off-by: Yong Zhao <[email protected]> Summary: This PR aims to increase the maximum value of Monit stable time in sanity check. Fixes # (issue) Type of change [x ] Bug fix Testbed and Framework(new/improvement) Test case(new/improvement) Approach What is the motivation for this PR? When this PR (#2890) was ran against virtual testbed, it restarted Monit service in the fixture after testing. This will cause the failure of sanity check for next test since Monit did not have enough time to initialize the states of services. How did you do it? I increase the maximum value of Monit stable time in sanity check. How did you verify/test it? I verify this on the virtual testbed by running the script kvmtest.sh to make sure pytest script can pass the test. Any platform specific information? N/A
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
…ck. (sonic-net#3171) Signed-off-by: Yong Zhao <[email protected]> Summary: This PR aims to increase the maximum value of Monit stable time in sanity check. Fixes # (issue) Type of change [x ] Bug fix Testbed and Framework(new/improvement) Test case(new/improvement) Approach What is the motivation for this PR? When this PR (sonic-net#2890) was ran against virtual testbed, it restarted Monit service in the fixture after testing. This will cause the failure of sanity check for next test since Monit did not have enough time to initialize the states of services. How did you do it? I increase the maximum value of Monit stable time in sanity check. How did you verify/test it? I verify this on the virtual testbed by running the script kvmtest.sh to make sure pytest script can pass the test. Any platform specific information? N/A
Signed-off-by: Yong Zhao [email protected] Description of PR Summary: This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251. Fixes # (issue) Type of change Bug fix Testbed and Framework(new/improvement) [ x] Test case(new/improvement) Approach What is the motivation for this PR? This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251. The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually. How did you do it? This pytest script will test the script container_checker in the following steps: Stop the containers explicitly. Check whether the names of stopped containers appear in the Monit alerting message. Restart the containers by the config_reload(...). Post-check all the critical processes are running and BGP sessions are established. How did you verify/test it? I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.
Signed-off-by: Yong Zhao [email protected]
Description of PR
Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.
Fixes # (issue)
Type of change
Approach
What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.
The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.
How did you do it?
This pytest script will test the script container_checker in the following steps:
config_reload(...)
.How did you verify/test it?
I tested the PR against the physical testbed (
str-dx010-acs-1
) which was installed image built from public master branch.Any platform specific information?
N/A
Supported testbed topology if it's a new test case?
N/A
Documentation