[pytest] Test the feature of container checker. #2890

yozhao101 · 2021-01-29T23:02:04Z

Signed-off-by: Yong Zhao [email protected]

Description of PR

Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Approach

What is the motivation for this PR?

This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?

This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the containers by the config_reload(...).
Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?

I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.

----------------------------------------------------------------------------------------------- live log call ------------------------------------------------------------------------------------------------
22:41:00 INFO test_container_checker.py:stop_containers:138: Stopping the container 'lldp'...
22:41:12 INFO test_container_checker.py:stop_containers:140: Waiting until container 'lldp' is stopped...
22:41:13 INFO test_container_checker.py:stop_containers:145: Container 'lldp' was stopped
22:41:13 INFO test_container_checker.py:stop_containers:138: Stopping the container 'bgp'...
22:41:18 INFO test_container_checker.py:stop_containers:140: Waiting until container 'bgp' is stopped...
22:41:19 INFO test_container_checker.py:stop_containers:145: Container 'bgp' was stopped
22:41:19 INFO test_container_checker.py:stop_containers:138: Stopping the container 'pmon'...
22:41:24 INFO test_container_checker.py:stop_containers:140: Waiting until container 'pmon' is stopped...
22:41:24 INFO test_container_checker.py:stop_containers:145: Container 'pmon' was stopped
22:41:24 INFO test_container_checker.py:stop_containers:138: Stopping the container 'telemetry'...
22:41:27 INFO test_container_checker.py:stop_containers:140: Waiting until container 'telemetry' is stopped...
22:41:28 INFO test_container_checker.py:stop_containers:145: Container 'telemetry' was stopped
22:41:28 INFO test_container_checker.py:stop_containers:138: Stopping the container 'snmp'...
22:41:35 INFO test_container_checker.py:stop_containers:140: Waiting until container 'snmp' is stopped...
22:41:36 INFO test_container_checker.py:stop_containers:145: Container 'snmp' was stopped
22:41:36 INFO test_container_checker.py:stop_containers:138: Stopping the container 'dhcp_relay'...
22:41:39 INFO test_container_checker.py:stop_containers:140: Waiting until container 'dhcp_relay' is stopped...
22:41:40 INFO test_container_checker.py:stop_containers:145: Container 'dhcp_relay' was stopped
22:41:40 INFO test_container_checker.py:stop_containers:138: Stopping the container 'mgmt-framework'...
22:41:43 INFO test_container_checker.py:stop_containers:140: Waiting until container 'mgmt-framework' is stopped...
22:41:43 INFO test_container_checker.py:stop_containers:145: Container 'mgmt-framework' was stopped
22:41:43 INFO test_container_checker.py:stop_containers:138: Stopping the container 'teamd'...
22:42:01 INFO test_container_checker.py:stop_containers:140: Waiting until container 'teamd' is stopped...
22:42:01 INFO test_container_checker.py:stop_containers:145: Container 'teamd' was stopped
22:42:01 INFO test_container_checker.py:stop_containers:138: Stopping the container 'syncd'...
22:42:16 INFO test_container_checker.py:stop_containers:140: Waiting until container 'syncd' is stopped...
22:42:17 INFO test_container_checker.py:stop_containers:145: Container 'syncd' was stopped
22:42:17 INFO test_container_checker.py:stop_containers:138: Stopping the container 'swss'...
22:42:19 INFO test_container_checker.py:stop_containers:140: Waiting until container 'swss' is stopped...
22:42:20 INFO test_container_checker.py:stop_containers:145: Container 'swss' was stopped
22:42:20 INFO test_container_checker.py:test_container_checker:214: Sleep 6 minutes to wait for the alerting message...
22:48:20 INFO test_container_checker.py:check_alerting_message:161: Checking the alerting message...
22:48:21 INFO test_container_checker.py:check_alerting_message:181: Checking the alerting message was done!
22:48:21 INFO test_container_checker.py:test_container_checker:219: Executing the config reload...
22:48:21 INFO config_reload.py:config_reload:24: reloading config_db
22:50:46 INFO test_container_checker.py:test_container_checker:221: Executing the config reload was done!
22:50:46 INFO test_container_checker.py:postcheck_critical_processes_status:116: Post-checking status of critical processes and BGP sessions...
22:50:53 INFO devices.py:critical_process_status:509: ====== supervisor process status for service pmon ======
22:50:55 INFO devices.py:critical_process_status:509: ====== supervisor process status for service snmp ======
22:50:58 INFO devices.py:critical_process_status:509: ====== supervisor process status for service lldp ======
22:51:00 INFO devices.py:critical_process_status:509: ====== supervisor process status for service database ======
22:51:03 INFO devices.py:critical_process_status:509: ====== supervisor process status for service bgp ======
22:51:05 INFO devices.py:critical_process_status:509: ====== supervisor process status for service database ======
22:51:07 INFO devices.py:critical_process_status:509: ====== supervisor process status for service lldp ======
22:51:10 INFO devices.py:critical_process_status:509: ====== supervisor process status for service swss ======
22:51:12 INFO devices.py:critical_process_status:509: ====== supervisor process status for service syncd ======
22:51:15 INFO devices.py:critical_process_status:509: ====== supervisor process status for service teamd ======
PASSED                                                                                                                                                                                                 
[100%]
--------------------------------------------------------------------------------------------- live log teardown ----------------------------------------------------------------------------------------------
22:51:16 INFO config_reload.py:config_reload:24: reloading config_db
22:54:50 INFO __init__.py:sanity_check:139: Start post-test sanity check
22:54:50 INFO __init__.py:sanity_check:142: No post-test check is required. Done post-test sanity check
1 passed, 2 warnings in 901.11 seconds

Any platform specific information?

N/A

Supported testbed topology if it's a new test case?

N/A

Documentation

Signed-off-by: Yong Zhao <[email protected]>

lguohan · 2021-01-31T01:34:37Z

/Azurepipelines run

azure-pipelines · 2021-01-31T01:34:57Z

Pull request contains merge conflicts.

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 · 2021-01-31T05:47:36Z

Retest vsimage please.

lguohan · 2021-01-31T06:37:17Z

the test costs 15 minutes, is there way to reduce the test time?

lguohan · 2021-01-31T06:37:25Z

/Azurepipelines run

azure-pipelines · 2021-01-31T06:37:37Z

Azure Pipelines successfully started running 1 pipeline(s).

lguohan · 2021-01-31T06:38:25Z

tests/container_checker/test_container_checker.py

+
+    # Wait for 6 minutes such that Monit has a chance to write alerting message into syslog.
+    logger.info("Sleep 6 minutes to wait for the alerting message...")
+    time.sleep(360)


this seems very long? why do we need to wait for 6 minutes?

Since the script container_checker was ran periodically by Monit and if one of containers was not running, Monit will write an alerting message into syslog after 5 minutes. So at here this pytest script has to wait for 6 minutes and then check whether the alerting message appeared in syslog or not.

can you reduce monit duration?

Currently the Monit configuration of script container_checker in image is as follows:

check program container_checker with path "/usr/bin/container_checker" if status != 0 for 5 times within 5 cycles then alert repeat every 1 cycles

Can we change the configuration like this? Then I can reduce the waiting time at here.

check program container_checker with path "/usr/bin/container_checker" if status != 0 for 2 times within 2 cycles then alert repeat every 1 cycles

yes, change to 2 cycles in the test and change it back after test finishes.

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 · 2021-01-31T20:48:19Z

tests/container_checker/test_container_checker.py

+    check_alerting_message(duthost, stopped_container_list)
+
+    logger.info("Executing the config reload...")
+    config_reload(duthost)


@bingwang-ms @wangxin Please help me review this PR again. The previous PR (#2852) was reverted due to BGP sessions were down after all containers are restarted and then it failed the post-check on KVM testbed. The root reason is I should not restart the containers according to the sequence of stopped containers. The correct way is to use config_reload(...) to restart the containers since some containers are depended on others. After running the config_reload(...), the line 255 will check whether all stopped containers are running or not and it will restart some containers in case the config_reload(...) did not restart them such as the container mgmt-framework.

LGTM. Except for the time.sleep(360) that guohan has mentioned. Please update the monit config for container_checker and shorten the sleep time. Thanks.

Thanks, Bing! Please help me review the function update_monit_service(...) and the line 274.

Why do we have to do config_reload here? is config_reload_after_tests not enough?

This script will first do the config_reload(...) to start the stopped containers and then do the post check to check whether BGP session are established and all critical processes are running,

Why not move config_reload and the checking below to config_reload_after_tests?

@yxieca This pytest script will skip the DuTs which were installed 201911 image. If this pytest script was ran against a DuT which was installed public image and if I removed the config_reload(...) at here and only invoked config_reload(...) in the fixture config_reload_after_test(...), then the mgmt-framework container will not be restarted. That is why at there this script will first invoke config_reload(..) and use the function check_containers_status(...) to restart the containers which can not be restarted by config_reload(...).

Signed-off-by: Yong Zhao <[email protected]>

tests/container_checker/test_container_checker.py

lguohan · 2021-02-02T09:40:41Z

tests/container_checker/test_container_checker.py

+        None.
+    """
+    logger.info("Reduing the monitoring interval of container_checker.")
+    duthost.shell("sudo sed -i '$s/5/2/g' /etc/monit/conf.d/sonic-host")


this code here is fragile, if sonic image change to 6 minutes, then it will break here.

Good catch!

…n 1. Signed-off-by: Yong Zhao <[email protected]>

yozhao101 · 2021-02-02T19:42:11Z

tests/container_checker/test_container_checker.py

+        None.
+    """
+    logger.info("Reduing the monitoring interval of container_checker.")
+    duthost.shell("sudo sed -i '$s/[2-9]\|[1-9][0-9]\+/2/g' /etc/monit/conf.d/sonic-host")


@lguohan Please help me review again!

Signed-off-by: Yong Zhao <[email protected]>

lguohan · 2021-02-11T23:05:53Z

tests/container_checker/test_container_checker.py

+    return stopped_container_list
+
+
+def check_alerting_message(duthost, stopped_container_list):


yes, use log analyzer

Signed-off-by: Yong Zhao <[email protected]>

with 201911 or old image versions. Signed-off-by: Yong Zhao <[email protected]>

tests/container_checker/test_container_checker.py

roll them back after testing. Signed-off-by: Yong Zhao <[email protected]>

Signed-off-by: Yong Zhao <[email protected]>

tests/container_checker/test_container_checker.py

files. Use `mv` instead of `cp -f` + `rm -f` to restore Monit configuration files. Signed-off-by: Yong Zhao <[email protected]>

overwrite. Signed-off-by: Yong Zhao <[email protected]>

Signed-off-by: Yong Zhao <[email protected]>

the post checking. Signed-off-by: Yong Zhao <[email protected]>

yxieca · 2021-03-11T19:44:29Z

/azp run

azure-pipelines · 2021-03-11T19:44:40Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Yong Zhao <[email protected]>

…ck. (#3171) Signed-off-by: Yong Zhao <[email protected]> Summary: This PR aims to increase the maximum value of Monit stable time in sanity check. Fixes # (issue) Type of change [x ] Bug fix Testbed and Framework(new/improvement) Test case(new/improvement) Approach What is the motivation for this PR? When this PR (#2890) was ran against virtual testbed, it restarted Monit service in the fixture after testing. This will cause the failure of sanity check for next test since Monit did not have enough time to initialize the states of services. How did you do it? I increase the maximum value of Monit stable time in sanity check. How did you verify/test it? I verify this on the virtual testbed by running the script kvmtest.sh to make sure pytest script can pass the test. Any platform specific information? N/A

yozhao101 · 2021-03-18T23:36:26Z

/azp run

azure-pipelines · 2021-03-18T23:36:37Z

Azure Pipelines successfully started running 1 pipeline(s).

…ck. (sonic-net#3171) Signed-off-by: Yong Zhao <[email protected]> Summary: This PR aims to increase the maximum value of Monit stable time in sanity check. Fixes # (issue) Type of change [x ] Bug fix Testbed and Framework(new/improvement) Test case(new/improvement) Approach What is the motivation for this PR? When this PR (sonic-net#2890) was ran against virtual testbed, it restarted Monit service in the fixture after testing. This will cause the failure of sanity check for next test since Monit did not have enough time to initialize the states of services. How did you do it? I increase the maximum value of Monit stable time in sanity check. How did you verify/test it? I verify this on the virtual testbed by running the script kvmtest.sh to make sure pytest script can pass the test. Any platform specific information? N/A

Signed-off-by: Yong Zhao [email protected] Description of PR Summary: This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251. Fixes # (issue) Type of change Bug fix Testbed and Framework(new/improvement) [ x] Test case(new/improvement) Approach What is the motivation for this PR? This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251. The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually. How did you do it? This pytest script will test the script container_checker in the following steps: Stop the containers explicitly. Check whether the names of stopped containers appear in the Monit alerting message. Restart the containers by the config_reload(...). Post-check all the critical processes are running and BGP sessions are established. How did you verify/test it? I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.

yozhao101 added 7 commits January 28, 2021 22:43

[pytest] Test the feature of container checker.

2453fb3

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Add some log information.

b11e303

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Add the log information.

28e9da5

Signed-off-by: Yong Zhao <[email protected]>

Merge branch 'master' into test_container_checker

eab6ece

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Delete extra blank lines.

6dde433

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Delete extra whitespace.

91fcad3

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Change the file mode to 0755.

1b9eae5

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 requested review from wangxin and bingwang-ms January 30, 2021 02:09

[pytest] Manually restart the mgmt-framework container.

0f86170

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 added 5 commits January 30, 2021 19:29

[pytest] Check whether the stopped containers were restarted or not.

e4ae303

Signed-off-by: Yong Zhao <[email protected]>

Merge branch 'master' into test_container_checker

7e0b948

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Change the file kvmtest.h.

622cace

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Remove the extra spaces.

206bc63

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Remove trailing spaces.

043aa71

Signed-off-by: Yong Zhao <[email protected]>

lguohan reviewed Jan 31, 2021

View reviewed changes

yozhao101 marked this pull request as ready for review January 31, 2021 14:23

[pytest] Reorganize the comments.

619fe8f

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 commented Jan 31, 2021

View reviewed changes

[pytest] Update the Monit configuration.

f52b2e6

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 commented Feb 2, 2021

View reviewed changes

tests/container_checker/test_container_checker.py Show resolved Hide resolved

lguohan reviewed Feb 2, 2021

View reviewed changes

[pytest] Use regex in sed command to find a number which is great tha…

c25f23a

…n 1. Signed-off-by: Yong Zhao <[email protected]>

yozhao101 commented Feb 2, 2021

View reviewed changes

yozhao101 added 3 commits February 5, 2021 08:47

[pytest] Reorganize the comments.

5c2ac25

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Fix the syntax error of Monit config entry.

9fe97e3

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Reorganize the comments.

dc43951

Signed-off-by: Yong Zhao <[email protected]>

lguohan reviewed Feb 11, 2021

View reviewed changes

yozhao101 added 5 commits February 12, 2021 15:27

[pytest] Use the logAnalyzer to analyze the alerting message.

d8b2d7e

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Did not load the command regex.

7280c23

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Use the append(...) instead of extend(...).

b04c5a6

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Stupid change and should use extend(...).

db478d4

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Use pytest_require(...) to skip the DUTs which was installed

c398d29

with 201911 or old image versions. Signed-off-by: Yong Zhao <[email protected]>

jleveque suggested changes Feb 24, 2021

View reviewed changes

tests/container_checker/test_container_checker.py Outdated Show resolved Hide resolved

yozhao101 added 2 commits February 23, 2021 23:48

[pytest] Backup Monit configuration files in the /tmp/ directory and

669b590

roll them back after testing. Signed-off-by: Yong Zhao <[email protected]>

[pytest] Remove extra periods.

7b4b6b1

Signed-off-by: Yong Zhao <[email protected]>

jleveque suggested changes Feb 24, 2021

View reviewed changes

tests/container_checker/test_container_checker.py Outdated Show resolved Hide resolved

tests/container_checker/test_container_checker.py Outdated Show resolved Hide resolved

yozhao101 added 2 commits February 24, 2021 10:30

[pytest] Use cp -f instead of cp to backup the Monit configuration

6a6696c

files. Use `mv` instead of `cp -f` + `rm -f` to restore Monit configuration files. Signed-off-by: Yong Zhao <[email protected]>

[pytest] Add -f option for the mv command and do not prompt before

1dab708

overwrite. Signed-off-by: Yong Zhao <[email protected]>

jleveque approved these changes Feb 27, 2021

View reviewed changes

yozhao101 added 2 commits February 28, 2021 21:45

Merge branch 'master' into test_container_checker

47eecca

Signed-off-by: Yong Zhao <[email protected]>

[pytest] Invoked the config_reload(...) after testing and then do

23d844b

the post checking. Signed-off-by: Yong Zhao <[email protected]>

yxieca approved these changes Mar 11, 2021

View reviewed changes

[pytest] Skip the gbsyncd container.

e8131dc

Signed-off-by: Yong Zhao <[email protected]>

yozhao101 mentioned this pull request Mar 18, 2021

[Monit] Increase maximum value of Monit stable time in sanity check. #3171

Merged

2 tasks

yozhao101 merged commit b96a54c into sonic-net:master Mar 19, 2021

yozhao101 deleted the test_container_checker branch March 19, 2021 23:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pytest] Test the feature of container checker. #2890

[pytest] Test the feature of container checker. #2890

yozhao101 commented Jan 29, 2021 •

edited

Loading

lguohan commented Jan 31, 2021

azure-pipelines bot commented Jan 31, 2021

yozhao101 commented Jan 31, 2021

lguohan commented Jan 31, 2021

lguohan commented Jan 31, 2021

azure-pipelines bot commented Jan 31, 2021

lguohan Jan 31, 2021

yozhao101 Jan 31, 2021

lguohan Jan 31, 2021

yozhao101 Feb 1, 2021

lguohan Feb 1, 2021

yozhao101 Jan 31, 2021

bingwang-ms Feb 2, 2021 •

edited

Loading

yozhao101 Feb 2, 2021

yxieca Mar 4, 2021

yozhao101 Mar 4, 2021

yxieca Mar 4, 2021

yozhao101 Mar 4, 2021 •

edited

Loading

lguohan Feb 2, 2021

yozhao101 Feb 2, 2021

yozhao101 Feb 2, 2021

lguohan Feb 11, 2021

yozhao101 Feb 24, 2021

yxieca commented Mar 11, 2021

azure-pipelines bot commented Mar 11, 2021

yozhao101 commented Mar 18, 2021

azure-pipelines bot commented Mar 18, 2021

		return stopped_container_list


		def check_alerting_message(duthost, stopped_container_list):

[pytest] Test the feature of container checker. #2890

[pytest] Test the feature of container checker. #2890

Conversation

yozhao101 commented Jan 29, 2021 • edited Loading

Description of PR

Type of change

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

lguohan commented Jan 31, 2021

azure-pipelines bot commented Jan 31, 2021

yozhao101 commented Jan 31, 2021

lguohan commented Jan 31, 2021

lguohan commented Jan 31, 2021

azure-pipelines bot commented Jan 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bingwang-ms Feb 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yozhao101 Mar 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yxieca commented Mar 11, 2021

azure-pipelines bot commented Mar 11, 2021

yozhao101 commented Mar 18, 2021

azure-pipelines bot commented Mar 18, 2021

yozhao101 commented Jan 29, 2021 •

edited

Loading

bingwang-ms Feb 2, 2021 •

edited

Loading

yozhao101 Mar 4, 2021 •

edited

Loading