Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytest] Test the feature of container checker. #2890

Merged
merged 33 commits into from
Mar 19, 2021
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
2453fb3
[pytest] Test the feature of container checker.
yozhao101 Jan 29, 2021
b11e303
[pytest] Add some log information.
yozhao101 Jan 29, 2021
28e9da5
[pytest] Add the log information.
yozhao101 Jan 29, 2021
eab6ece
Merge branch 'master' into test_container_checker
yozhao101 Jan 29, 2021
6dde433
[pytest] Delete extra blank lines.
yozhao101 Jan 29, 2021
91fcad3
[pytest] Delete extra whitespace.
yozhao101 Jan 29, 2021
1b9eae5
[pytest] Change the file mode to 0755.
yozhao101 Jan 29, 2021
0f86170
[pytest] Manually restart the mgmt-framework container.
yozhao101 Jan 30, 2021
e4ae303
[pytest] Check whether the stopped containers were restarted or not.
yozhao101 Jan 31, 2021
7e0b948
Merge branch 'master' into test_container_checker
yozhao101 Jan 31, 2021
622cace
[pytest] Change the file kvmtest.h.
yozhao101 Jan 31, 2021
206bc63
[pytest] Remove the extra spaces.
yozhao101 Jan 31, 2021
043aa71
[pytest] Remove trailing spaces.
yozhao101 Jan 31, 2021
619fe8f
[pytest] Reorganize the comments.
yozhao101 Jan 31, 2021
f52b2e6
[pytest] Update the Monit configuration.
yozhao101 Feb 2, 2021
c25f23a
[pytest] Use regex in sed command to find a number which is great tha…
yozhao101 Feb 2, 2021
b02cdca
[pytest] Fix a typo.
yozhao101 Feb 3, 2021
ce74aa2
[pytest] Fix the issue related to updating the Monit config of contai…
yozhao101 Feb 5, 2021
5c2ac25
[pytest] Reorganize the comments.
yozhao101 Feb 5, 2021
9fe97e3
[pytest] Fix the syntax error of Monit config entry.
yozhao101 Feb 5, 2021
dc43951
[pytest] Reorganize the comments.
yozhao101 Feb 10, 2021
d8b2d7e
[pytest] Use the logAnalyzer to analyze the alerting message.
yozhao101 Feb 12, 2021
7280c23
[pytest] Did not load the command regex.
yozhao101 Feb 14, 2021
b04c5a6
[pytest] Use the `append(...)` instead of `extend(...)`.
yozhao101 Feb 18, 2021
db478d4
[pytest] Stupid change and should use `extend(...)`.
yozhao101 Feb 18, 2021
c398d29
[pytest] Use `pytest_require(...)` to skip the DUTs which was installed
yozhao101 Feb 18, 2021
669b590
[pytest] Backup Monit configuration files in the /tmp/ directory and
yozhao101 Feb 24, 2021
7b4b6b1
[pytest] Remove extra periods.
yozhao101 Feb 24, 2021
6a6696c
[pytest] Use `cp -f` instead of `cp` to backup the Monit configuration
yozhao101 Feb 24, 2021
1dab708
[pytest] Add `-f` option for the `mv` command and do not prompt before
yozhao101 Feb 24, 2021
47eecca
Merge branch 'master' into test_container_checker
yozhao101 Mar 1, 2021
23d844b
[pytest] Invoked the `config_reload(...)` after testing and then do
yozhao101 Mar 8, 2021
e8131dc
[pytest] Skip the gbsyncd container.
yozhao101 Mar 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions tests/common/helpers/dut_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
from tests.common.helpers.assertions import pytest_assert
from tests.common.utilities import get_host_visible_vars
from tests.common.utilities import wait_until

CONTAINER_CHECK_INTERVAL_SECS = 1
CONTAINER_RESTART_THRESHOLD_SECS = 180


def is_supervisor_node(inv_files, hostname):
Expand Down Expand Up @@ -27,3 +32,59 @@ def is_frontend_node(inv_files, hostname):
node. If we add more types of nodes, then we need to exclude them from this method as well.
"""
return not is_supervisor_node(inv_files, hostname)


def is_container_running(duthost, container_name):
"""Decides whether the container is running or not
@param duthost: Host DUT.
@param container_name: Name of a container.
Returns:
Boolean value. True represents the container is running
"""
result = duthost.shell("docker inspect -f \{{\{{.State.Running\}}\}} {}".format(container_name))
return result["stdout_lines"][0].strip() == "true"


def check_container_state(duthost, container_name, should_be_running):
"""Determines whether a container is in the expected state (running/not running)
@param duthost: Host DUT.
@param container_name: Name of container.
@param should_be_running: Boolean value.
Returns:
This function will return True if the container was in the expected state.
Otherwise, it will return False.
"""
is_running = is_container_running(duthost, container_name)
return is_running == should_be_running


def is_hitting_start_limit(duthost, container_name):
"""Checks whether the container can not be restarted is due to start-limit-hit.
@param duthost: Host DUT.
@param ontainer_name: name of a container.
Returns:
If start limitation was hit, then this function will return True. Otherwise
it returns False.
"""
service_status = duthost.shell("sudo systemctl status {}.service | grep 'Active'".format(container_name))
for line in service_status["stdout_lines"]:
if "start-limit-hit" in line:
return True

return False


def clear_failed_flag_and_restart(duthost, container_name):
"""Clears the failed flag of a container and restart it.
@param duthost: Host DUT.
@param container_name: name of a container.
Returns:
None
"""
logger.info("{} hits start limit and clear reset-failed flag".format(container_name))
duthost.shell("sudo systemctl reset-failed {}.service".format(container_name))
duthost.shell("sudo systemctl start {}.service".format(container_name))
restarted = wait_until(CONTAINER_RESTART_THRESHOLD_SECS,
CONTAINER_CHECK_INTERVAL_SECS,
check_container_state, duthost, container_name, True)
pytest_assert(restarted, "Failed to restart container '{}' after reset-failed was cleared".format(container_name))
289 changes: 289 additions & 0 deletions tests/container_checker/test_container_checker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
"""
Test the feature of container_checker
"""
import logging

import pytest

from pkg_resources import parse_version
from tests.common import config_reload
from tests.common.helpers.assertions import pytest_assert
from tests.common.helpers.assertions import pytest_require
from tests.common.helpers.dut_utils import check_container_state
from tests.common.helpers.dut_utils import clear_failed_flag_and_restart
from tests.common.helpers.dut_utils import is_hitting_start_limit
from tests.common.helpers.dut_utils import is_container_running
from tests.common.plugins.loganalyzer.loganalyzer import LogAnalyzer, LogAnalyzerError
from tests.common.utilities import wait_until

logger = logging.getLogger(__name__)

pytestmark = [
pytest.mark.topology('any'),
pytest.mark.disable_loganalyzer
]

CONTAINER_CHECK_INTERVAL_SECS = 1
CONTAINER_STOP_THRESHOLD_SECS = 30
CONTAINER_RESTART_THRESHOLD_SECS = 180


@pytest.fixture(autouse=True, scope='module')
def config_reload_after_tests(duthost):
yield
config_reload(duthost)


@pytest.fixture(autouse=True, scope="module")
def check_image_version(duthost):
"""Skips this test if the SONiC image installed on DUT was 201911 or old version.

Args:
duthost: Host DUT.

Return:
None.
"""
pytest_require(parse_version(duthost.kernel_version) > parse_version("4.9.0"),
"Test was not supported for 201911 and older image version!")


@pytest.fixture(autouse=True, scope="module")
def update_monit_service(duthost):
yozhao101 marked this conversation as resolved.
Show resolved Hide resolved
"""Update Monit configuration and restart it.

This function will first reduce the monitoring interval of container checker
from 5 minutes to 2 minutes, then restart Monit service without delaying. After
testing, these two changes will be rolled back.

Args:
duthost: name of Host DUT.

Return:
None.
"""
logger.info("Back up Monit configuration files.")
duthost.shell("sudo cp -f /etc/monit/monitrc /tmp/")
duthost.shell("sudo cp -f /etc/monit/conf.d/sonic-host /tmp/")

temp_config_line = " if status != 0 for 2 times within 2 cycles then alert repeat every 1 cycles"
logger.info("Reduce the monitoring interval of container_checker.")
duthost.shell("sudo sed -i '$s/^./#/' /etc/monit/conf.d/sonic-host")
duthost.shell("echo '{}' | sudo tee -a /etc/monit/conf.d/sonic-host".format(temp_config_line))
duthost.shell("sudo sed -i '/with start delay 300/s/^./#/' /etc/monit/monitrc")
logger.info("Restart the Monit service without delaying to monitor.")
duthost.shell("sudo systemctl restart monit")
yield
logger.info("Roll back the Monit configuration of container checker.")
duthost.shell("sudo mv -f /tmp/monitrc /etc/monit/")
duthost.shell("sudo mv -f /tmp/sonic-host /etc/monit/conf.d/")
logger.info("Restart the Monit service and delay monitoring for 5 minutes.")
duthost.shell("sudo systemctl restart monit")


def get_disabled_container_list(duthost):
"""Gets the container/service names which are disabled.

Args:
duthost: Host DUT.

Return:
A list includes the names of disabled containers/services
"""
disabled_containers = []

container_status, succeeded = duthost.get_feature_status()
pytest_assert(succeeded, "Failed to get status ('enabled'|'disabled') of containers. Exiting...")

for container_name, status in container_status.items():
if status == "disabled":
disabled_containers.append(container_name)

return disabled_containers


def check_all_critical_processes_status(duthost):
"""Post-checks the status of critical processes.

Args:
duthost: Host DUT.

Return:
This function will return True if all critical processes are running.
Otherwise it will return False.
"""
processes_status = duthost.all_critical_process_status()
for container_name, processes in processes_status.items():
if processes["status"] is False or len(processes["exited_critical_process"]) > 0:
return False

return True


def post_test_check(duthost, up_bgp_neighbors):
"""Post-checks the status of critical processes and state of BGP sessions.

Args:
duthost: Host DUT.
skip_containers: A list contains the container names which should be skipped.

Return:
This function will return True if all critical processes are running and
all BGP sessions are established. Otherwise it will return False.
"""
return check_all_critical_processes_status(duthost) and duthost.check_bgp_session_state(up_bgp_neighbors, "established")


def postcheck_critical_processes_status(duthost, up_bgp_neighbors):
"""Calls the functions to post-check the status of critical processes and
state of BGP sessions.

Args:
duthost: Host DUT.
skip_containers: A list contains the container names which should be skipped.

Return:
If all critical processes are running and all BGP sessions are established, it
returns True. Otherwise it will call the function to do post-check every 30 seconds
for 3 minutes. It will return False after timeout
"""
logger.info("Post-checking status of critical processes and BGP sessions...")
return wait_until(CONTAINER_RESTART_THRESHOLD_SECS, CONTAINER_CHECK_INTERVAL_SECS,
post_test_check, duthost, up_bgp_neighbors)


def stop_containers(duthost, container_autorestart_states, skip_containers):
"""Stops the running containers and returns their names as a list.

Args:
duthost: Host DUT.
container_autorestart_states: A dictionary which key is container name and
value is the state of autorestart feature.
skip_containers: A list contains the container names which should be skipped.

Return:
A list contains the container names which are stopped.
"""
stopped_container_list = []

for container_name in container_autorestart_states.keys():
if container_name not in skip_containers:
logger.info("Stopping the container '{}'...".format(container_name))
duthost.shell("sudo systemctl stop {}.service".format(container_name))
logger.info("Waiting until container '{}' is stopped...".format(container_name))
stopped = wait_until(CONTAINER_STOP_THRESHOLD_SECS,
CONTAINER_CHECK_INTERVAL_SECS,
check_container_state, duthost, container_name, False)
pytest_assert(stopped, "Failed to stop container '{}'".format(container_name))
logger.info("Container '{}' was stopped".format(container_name))
stopped_container_list.append(container_name)

return stopped_container_list


def get_expected_alerting_messages(stopped_container_list):
"""Generates the expected alerting messages from the stopped containers.

Args:
stopped_container_list: A list contains container names.

Return:
A list contains the expected alerting messages.
"""
logger.info("Generating the expected alerting messages...")
expected_alerting_messages = []

for container_name in stopped_container_list:
expected_alerting_messages.append(".*Expected containers not running.*{}.*".format(container_name))

logger.info("Generating the expected alerting messages was done!")
return expected_alerting_messages


def check_containers_status(duthost, stopped_container_list):
"""Checks whether the stopped containers were started.

This function will check whether the stopped containers were restarted or not.
If the container was not restarted by the function 'config_reload(...)', then this function
will start it and then check its status.

Args:
duthost: Hostname of DUT.
stopped_container_list: names of stopped containers.

Returns:
None.
"""
for container_name in stopped_container_list:
logger.info("Checking the running status of container '{}'".format(container_name))
if is_container_running(duthost, container_name):
logger.info("Container '{}' is running.".format(container_name))
else:
logger.info("Container '{}' is not running and restart it...".format(container_name))
duthost.shell("sudo systemctl restart {}".format(container_name))
logger.info("Waiting until container '{}' is restarted...".format(container_name))
restarted = wait_until(CONTAINER_RESTART_THRESHOLD_SECS,
CONTAINER_CHECK_INTERVAL_SECS,
check_container_state, duthost, container_name, True)
if not restarted:
if is_hiting_start_limit(duthost, container_name):
clear_failed_flag_and_restart(duthost, container_name)
else:
pytest.fail("Failed to restart container '{}'".format(container_name))

logger.info("Container '{}' was restarted".format(container_name))


def test_container_checker(duthosts, rand_one_dut_hostname, tbinfo):
"""Tests the feature of container checker.

This function will check whether the container names will appear in the Monit
alerting message if they are stopped explicitly or they hit start limitation.

Args:
duthosts: list of DUTs.
rand_one_dut_hostname: hostname of DUT.
tbinfo: Testbed information.

Returns:
None.
"""
duthost = duthosts[rand_one_dut_hostname]
loganalyzer = LogAnalyzer(ansible_host=duthost, marker_prefix="container_checker")
loganalyzer.expect_regex = []

container_autorestart_states = duthost.get_container_autorestart_states()
disabled_containers = get_disabled_container_list(duthost)

bgp_neighbors = duthost.get_bgp_neighbors()
up_bgp_neighbors = [ k.lower() for k, v in bgp_neighbors.items() if v["state"] == "established" ]

skip_containers = disabled_containers[:]
# Skip 'radv' container on devices whose role is not T0.
if tbinfo["topo"]["type"] != "t0":
skip_containers.append("radv")

stopped_container_list = stop_containers(duthost, container_autorestart_states, skip_containers)
pytest_assert(len(stopped_container_list) > 0, "None of containers was stopped!")

expected_alerting_messages = get_expected_alerting_messages(stopped_container_list)
loganalyzer.expect_regex.extend(expected_alerting_messages)
marker = loganalyzer.init()

# Wait for 2 minutes such that Monit has a chance to write alerting message into syslog.
logger.info("Sleep 2 minutes to wait for the alerting message...")
time.sleep(130)

logger.info("Checking the alerting messages from syslog...")
loganalyzer.analyze(marker)
logger.info("Found all the expected alerting messages from syslog!")

logger.info("Executing the config reload...")
config_reload(duthost)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bingwang-ms @wangxin Please help me review this PR again. The previous PR (#2852) was reverted due to BGP sessions were down after all containers are restarted and then it failed the post-check on KVM testbed. The root reason is I should not restart the containers according to the sequence of stopped containers. The correct way is to use config_reload(...) to restart the containers since some containers are depended on others. After running the config_reload(...), the line 255 will check whether all stopped containers are running or not and it will restart some containers in case the config_reload(...) did not restart them such as the container mgmt-framework.

Copy link
Collaborator

@bingwang-ms bingwang-ms Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Except for the time.sleep(360) that guohan has mentioned. Please update the monit config for container_checker and shorten the sleep time. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Bing! Please help me review the function update_monit_service(...) and the line 274.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to do config_reload here? is config_reload_after_tests not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script will first do the config_reload(...) to start the stopped containers and then do the post check to check whether BGP session are established and all critical processes are running,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move config_reload and the checking below to config_reload_after_tests?

Copy link
Contributor Author

@yozhao101 yozhao101 Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca This pytest script will skip the DuTs which were installed 201911 image. If this pytest script was ran against a DuT which was installed public image and if I removed the config_reload(...) at here and only invoked config_reload(...) in the fixture config_reload_after_test(...), then the mgmt-framework container will not be restarted. That is why at there this script will first invoke config_reload(..) and use the function check_containers_status(...) to restart the containers which can not be restarted by config_reload(...).

logger.info("Executing the config reload was done!")

check_containers_status(duthost, stopped_container_list)

if not postcheck_critical_processes_status(duthost, up_bgp_neighbors):
pytest.fail("Post-check failed after testing the container checker!")
logger.info("Post-checking status of critical processes and BGP sessions was done!")
6 changes: 4 additions & 2 deletions tests/kvmtest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,8 @@ test_t0() {
test_procdockerstatsd.py \
iface_namingmode/test_iface_namingmode.py \
platform_tests/test_cpu_memory_usage.py \
bgp/test_bgpmon.py"
bgp/test_bgpmon.py \
container_checker/test_container_checker.py"

pushd $SONIC_MGMT_DIR/tests
./run_tests.sh $RUNTEST_CLI_COMMON_OPTS -c "$tests" -p logs/$tgname
Expand Down Expand Up @@ -154,7 +155,8 @@ test_t1_lag() {
lldp/test_lldp.py \
route/test_default_route.py \
platform_tests/test_cpu_memory_usage.py \
bgp/test_bgpmon.py"
bgp/test_bgpmon.py \
container_checker/test_container_checker.py"

pushd $SONIC_MGMT_DIR/tests
./run_tests.sh $RUNTEST_CLI_COMMON_OPTS -c "$tests" -p logs/$tgname
Expand Down