Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytest] Test the feature of container checker. #2890

Merged
merged 33 commits into from
Mar 19, 2021

Conversation

yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Jan 29, 2021

Signed-off-by: Yong Zhao [email protected]

Description of PR

Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • [ x] Test case(new/improvement)

Approach

What is the motivation for this PR?

This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?

This pytest script will test the script container_checker in the following steps:

  1. Stop the containers explicitly.
  2. Check whether the names of stopped containers appear in the Monit alerting message.
  3. Restart the containers by the config_reload(...).
  4. Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?

I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.

----------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------
22:41:00 INFO test_container_checker.py:stop_containers:138: Stopping the container 'lldp'...
22:41:12 INFO test_container_checker.py:stop_containers:140: Waiting until container 'lldp' is stopped...
22:41:13 INFO test_container_checker.py:stop_containers:145: Container 'lldp' was stopped
22:41:13 INFO test_container_checker.py:stop_containers:138: Stopping the container 'bgp'...
22:41:18 INFO test_container_checker.py:stop_containers:140: Waiting until container 'bgp' is stopped...
22:41:19 INFO test_container_checker.py:stop_containers:145: Container 'bgp' was stopped
22:41:19 INFO test_container_checker.py:stop_containers:138: Stopping the container 'pmon'...
22:41:24 INFO test_container_checker.py:stop_containers:140: Waiting until container 'pmon' is stopped...
22:41:24 INFO test_container_checker.py:stop_containers:145: Container 'pmon' was stopped
22:41:24 INFO test_container_checker.py:stop_containers:138: Stopping the container 'telemetry'...
22:41:27 INFO test_container_checker.py:stop_containers:140: Waiting until container 'telemetry' is stopped...
22:41:28 INFO test_container_checker.py:stop_containers:145: Container 'telemetry' was stopped
22:41:28 INFO test_container_checker.py:stop_containers:138: Stopping the container 'snmp'...
22:41:35 INFO test_container_checker.py:stop_containers:140: Waiting until container 'snmp' is stopped...
22:41:36 INFO test_container_checker.py:stop_containers:145: Container 'snmp' was stopped
22:41:36 INFO test_container_checker.py:stop_containers:138: Stopping the container 'dhcp_relay'...
22:41:39 INFO test_container_checker.py:stop_containers:140: Waiting until container 'dhcp_relay' is stopped...
22:41:40 INFO test_container_checker.py:stop_containers:145: Container 'dhcp_relay' was stopped
22:41:40 INFO test_container_checker.py:stop_containers:138: Stopping the container 'mgmt-framework'...
22:41:43 INFO test_container_checker.py:stop_containers:140: Waiting until container 'mgmt-framework' is stopped...
22:41:43 INFO test_container_checker.py:stop_containers:145: Container 'mgmt-framework' was stopped
22:41:43 INFO test_container_checker.py:stop_containers:138: Stopping the container 'teamd'...
22:42:01 INFO test_container_checker.py:stop_containers:140: Waiting until container 'teamd' is stopped...
22:42:01 INFO test_container_checker.py:stop_containers:145: Container 'teamd' was stopped
22:42:01 INFO test_container_checker.py:stop_containers:138: Stopping the container 'syncd'...
22:42:16 INFO test_container_checker.py:stop_containers:140: Waiting until container 'syncd' is stopped...
22:42:17 INFO test_container_checker.py:stop_containers:145: Container 'syncd' was stopped
22:42:17 INFO test_container_checker.py:stop_containers:138: Stopping the container 'swss'...
22:42:19 INFO test_container_checker.py:stop_containers:140: Waiting until container 'swss' is stopped...
22:42:20 INFO test_container_checker.py:stop_containers:145: Container 'swss' was stopped
22:42:20 INFO test_container_checker.py:test_container_checker:214: Sleep 6 minutes to wait for the alerting message...
22:48:20 INFO test_container_checker.py:check_alerting_message:161: Checking the alerting message...
22:48:21 INFO test_container_checker.py:check_alerting_message:181: Checking the alerting message was done!
22:48:21 INFO test_container_checker.py:test_container_checker:219: Executing the config reload...
22:48:21 INFO config_reload.py:config_reload:24: reloading config_db
22:50:46 INFO test_container_checker.py:test_container_checker:221: Executing the config reload was done!
22:50:46 INFO test_container_checker.py:postcheck_critical_processes_status:116: Post-checking status of critical processes and BGP sessions...
22:50:53 INFO devices.py:critical_process_status:509: ====== supervisor process status for service pmon ======
22:50:55 INFO devices.py:critical_process_status:509: ====== supervisor process status for service snmp ======
22:50:58 INFO devices.py:critical_process_status:509: ====== supervisor process status for service lldp ======
22:51:00 INFO devices.py:critical_process_status:509: ====== supervisor process status for service database ======
22:51:03 INFO devices.py:critical_process_status:509: ====== supervisor process status for service bgp ======
22:51:05 INFO devices.py:critical_process_status:509: ====== supervisor process status for service database ======
22:51:07 INFO devices.py:critical_process_status:509: ====== supervisor process status for service lldp ======
22:51:10 INFO devices.py:critical_process_status:509: ====== supervisor process status for service swss ======
22:51:12 INFO devices.py:critical_process_status:509: ====== supervisor process status for service syncd ======
22:51:15 INFO devices.py:critical_process_status:509: ====== supervisor process status for service teamd ======
PASSED                                                                                                                                                                                                 
[100%]
--------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
22:51:16 INFO config_reload.py:config_reload:24: reloading config_db
22:54:50 INFO __init__.py:sanity_check:139: Start post-test sanity check
22:54:50 INFO __init__.py:sanity_check:142: No post-test check is required. Done post-test sanity check
1 passed, 2 warnings in 901.11 seconds 

Any platform specific information?

N/A

Supported testbed topology if it's a new test case?

N/A

Documentation

@lguohan
Copy link
Contributor

lguohan commented Jan 31, 2021

/Azurepipelines run

@azure-pipelines
Copy link

Pull request contains merge conflicts.

@yozhao101
Copy link
Contributor Author

Retest vsimage please.

@lguohan
Copy link
Contributor

lguohan commented Jan 31, 2021

the test costs 15 minutes, is there way to reduce the test time?

@lguohan
Copy link
Contributor

lguohan commented Jan 31, 2021

/Azurepipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


# Wait for 6 minutes such that Monit has a chance to write alerting message into syslog.
logger.info("Sleep 6 minutes to wait for the alerting message...")
time.sleep(360)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very long? why do we need to wait for 6 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the script container_checker was ran periodically by Monit and if one of containers was not running, Monit will write an alerting message into syslog after 5 minutes. So at here this pytest script has to wait for 6 minutes and then check whether the alerting message appeared in syslog or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reduce monit duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the Monit configuration of script container_checker in image is as follows:

check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 5 times within 5 cycles then alert repeat every 1 cycles

Can we change the configuration like this? Then I can reduce the waiting time at here.

check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 2 times within 2 cycles then alert repeat every 1 cycles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, change to 2 cycles in the test and change it back after test finishes.

@yozhao101 yozhao101 marked this pull request as ready for review January 31, 2021 14:23
check_alerting_message(duthost, stopped_container_list)

logger.info("Executing the config reload...")
config_reload(duthost)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bingwang-ms @wangxin Please help me review this PR again. The previous PR (#2852) was reverted due to BGP sessions were down after all containers are restarted and then it failed the post-check on KVM testbed. The root reason is I should not restart the containers according to the sequence of stopped containers. The correct way is to use config_reload(...) to restart the containers since some containers are depended on others. After running the config_reload(...), the line 255 will check whether all stopped containers are running or not and it will restart some containers in case the config_reload(...) did not restart them such as the container mgmt-framework.

Copy link
Collaborator

@bingwang-ms bingwang-ms Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Except for the time.sleep(360) that guohan has mentioned. Please update the monit config for container_checker and shorten the sleep time. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Bing! Please help me review the function update_monit_service(...) and the line 274.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to do config_reload here? is config_reload_after_tests not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script will first do the config_reload(...) to start the stopped containers and then do the post check to check whether BGP session are established and all critical processes are running,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move config_reload and the checking below to config_reload_after_tests?

Copy link
Contributor Author

@yozhao101 yozhao101 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca This pytest script will skip the DuTs which were installed 201911 image. If this pytest script was ran against a DuT which was installed public image and if I removed the config_reload(...) at here and only invoked config_reload(...) in the fixture config_reload_after_test(...), then the mgmt-framework container will not be restarted. That is why at there this script will first invoke config_reload(..) and use the function check_containers_status(...) to restart the containers which can not be restarted by config_reload(...).

None.
"""
logger.info("Reduing the monitoring interval of container_checker.")
duthost.shell("sudo sed -i '$s/5/2/g' /etc/monit/conf.d/sonic-host")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code here is fragile, if sonic image change to 6 minutes, then it will break here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

None.
"""
logger.info("Reduing the monitoring interval of container_checker.")
duthost.shell("sudo sed -i '$s/[2-9]\|[1-9][0-9]\+/2/g' /etc/monit/conf.d/sonic-host")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lguohan Please help me review again!

return stopped_container_list


def check_alerting_message(duthost, stopped_container_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, use log analyzer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

files. Use `mv` instead of `cp -f` + `rm -f` to restore Monit
configuration files.

Signed-off-by: Yong Zhao <[email protected]>
@yxieca
Copy link
Collaborator

yxieca commented Mar 11, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

yozhao101 added a commit that referenced this pull request Mar 18, 2021
…ck. (#3171)

Signed-off-by: Yong Zhao <[email protected]>

Summary:
This PR aims to increase the maximum value of Monit stable time in sanity check.

Fixes # (issue)

Type of change
[x ] Bug fix
 Testbed and Framework(new/improvement)
 Test case(new/improvement)
Approach
What is the motivation for this PR?
When this PR (#2890) was ran against virtual testbed, it restarted Monit service in the fixture after testing. This will cause the failure of sanity check for next test since Monit did not have enough time to initialize the states of services.

How did you do it?
I increase the maximum value of Monit stable time in sanity check.

How did you verify/test it?
I verify this on the virtual testbed by running the script kvmtest.sh to make sure pytest script can pass the test.

Any platform specific information?
N/A
@yozhao101
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yozhao101 yozhao101 merged commit b96a54c into sonic-net:master Mar 19, 2021
@yozhao101 yozhao101 deleted the test_container_checker branch March 19, 2021 23:28
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
…ck. (sonic-net#3171)

Signed-off-by: Yong Zhao <[email protected]>

Summary:
This PR aims to increase the maximum value of Monit stable time in sanity check.

Fixes # (issue)

Type of change
[x ] Bug fix
 Testbed and Framework(new/improvement)
 Test case(new/improvement)
Approach
What is the motivation for this PR?
When this PR (sonic-net#2890) was ran against virtual testbed, it restarted Monit service in the fixture after testing. This will cause the failure of sanity check for next test since Monit did not have enough time to initialize the states of services.

How did you do it?
I increase the maximum value of Monit stable time in sanity check.

How did you verify/test it?
I verify this on the virtual testbed by running the script kvmtest.sh to make sure pytest script can pass the test.

Any platform specific information?
N/A
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
Signed-off-by: Yong Zhao [email protected]

Description of PR
Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Approach

What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the containers by the config_reload(...).
Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?
I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants