Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd]: Warm/Fast reboot: Send USR1/USR2 only to subscribers #8856

Merged

Conversation

nazariig
Copy link
Collaborator

@nazariig nazariig commented Sep 28, 2021

Signed-off-by: Nazarii Hnydyn [email protected]

SONiC LAG configuration:

root@sonic:/home/admin# show interfaces portchannel
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
       S - selected, D - deselected, * - not synced
  No.  Team Dev         Protocol     Ports
-----  ---------------  -----------  --------------
 0001  PortChannel0001  LACP(A)(Dw)  Ethernet224(D)
 0002  PortChannel0002  LACP(A)(Dw)  Ethernet232(D)
 0003  PortChannel0003  LACP(A)(Dw)  Ethernet240(D)
 0004  PortChannel0004  LACP(A)(Dw)  Ethernet248(D)

Before:

Sep 27 15:36:39.593423 sonic NOTICE admin: Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Sep 27 15:36:39.818878 sonic NOTICE admin: Backing up database ...
Sep 27 15:36:39.892730 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0003' has been removed.
Sep 27 15:36:39.892826 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0004' has been removed.
Sep 27 15:36:39.892877 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0002' has been removed.
Sep 27 15:36:39.892899 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0001' has been removed.
Sep 27 15:36:39.980706 sonic INFO database#/supervisord: redis 35:M 27 Sep 2021 15:36:39.980 * DB saved on disk
Sep 27 15:36:40.104994 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-sonic.mount: Succeeded.
Sep 27 15:36:40.125423 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-usr-share-sonic-platform.mount: Succeeded.
Sep 27 15:36:40.156169 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-run-redis.mount: Succeeded.
Sep 27 15:36:40.182548 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-resolv.conf.mount: Succeeded.
Sep 27 15:36:40.202195 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-hostname.mount: Succeeded.
Sep 27 15:36:40.221990 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-hosts.mount: Succeeded.
Sep 27 15:36:40.397449 sonic NOTICE admin: Stopping teamd ...
Sep 27 15:36:40.406095 sonic INFO systemd[1]: Stopping TEAMD container...
Sep 27 15:36:40.413385 sonic NOTICE admin: Stopping teamd service...
Sep 27 15:36:40.615055 sonic NOTICE admin: Warm boot flag: teamd true.
Sep 27 15:36:40.621085 sonic NOTICE admin: Fast boot flag: teamd false.
Sep 27 15:36:40.755013 sonic INFO teamd#supervisord 2021-09-27 15:36:40,754 INFO exited: tlm_teamd (terminated by SIGUSR1; not expected)
Sep 27 15:36:40.979849 sonic DEBUG /container: container_stop: BEGIN
Sep 27 15:36:40.980191 sonic DEBUG /container: read_data: config:True feature:teamd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Sep 27 15:36:40.980440 sonic DEBUG /container: read_data: config:False feature:teamd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Sep 27 15:36:40.980672 sonic DEBUG /container: container_stop: teamd: set_owner:local current_owner:none remote_state:none docker_id:teamd
Sep 27 15:36:41.758506 sonic INFO teamd#supervisord 2021-09-27 15:36:41,756 INFO reaped unknown pid 25 (exit status 0)
Sep 27 15:36:41.758506 sonic INFO teamd#supervisord 2021-09-27 15:36:41,756 INFO reaped unknown pid 34 (exit status 0)
Sep 27 15:36:41.758506 sonic INFO teamd#supervisord 2021-09-27 15:36:41,757 INFO reaped unknown pid 42 (exit status 0)
Sep 27 15:36:41.758506 sonic INFO teamd#supervisord 2021-09-27 15:36:41,757 INFO reaped unknown pid 50 (exit status 0)
Sep 27 15:36:41.766193 sonic INFO teamd#/supervisor-proc-exit-listener: Process 'tlm_teamd' exited unexpectedly. Terminating supervisor 'teamd'
Sep 27 15:36:41.767071 sonic INFO teamd#supervisord 2021-09-27 15:36:41,766 WARN received SIGTERM indicating exit request
Sep 27 15:36:41.767331 sonic INFO teamd#supervisord 2021-09-27 15:36:41,766 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd, teamsyncd to die
Sep 27 15:36:43.770027 sonic NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Sep 27 15:36:43.771247 sonic NOTICE teamd#teamsyncd: :- main: Received SIGTERM Exiting
Sep 27 15:36:44.416124 sonic INFO teamd#supervisord 2021-09-27 15:36:44,415 INFO stopped: teamsyncd (exit status 0)
Sep 27 15:36:45.417286 sonic NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Sep 27 15:36:45.418185 sonic INFO teamd#supervisord 2021-09-27 15:36:45,417 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Sep 27 15:36:45.421040 sonic INFO teamd#/supervisord: teammgrd cat: /var/run/teamd/PortChannel0001.pid: No such file or directory
Sep 27 15:36:45.421795 sonic ERR teamd#teammgrd: :- main: Runtime error: cat "/var/run/teamd/PortChannel0001.pid" :
Sep 27 15:36:46.271465 sonic INFO teamd#supervisord 2021-09-27 15:36:46,270 INFO stopped: teammgrd (exit status 1)
Sep 27 15:36:47.425981 sonic INFO containerd[486]: time="2021-09-27T15:36:47.424407649Z" level=info msg="shim reaped" id=90f942c9e7be6991ae0ce30bc71738de51f1f599773106caf1c70860cb87731a
Sep 27 15:36:47.436674 sonic INFO dockerd[726]: time="2021-09-27T15:36:47.434422086Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 27 15:36:47.446982 sonic INFO systemd[1]: var-lib-docker-containers-90f942c9e7be6991ae0ce30bc71738de51f1f599773106caf1c70860cb87731a-mounts-shm.mount: Succeeded.
Sep 27 15:36:47.469852 sonic INFO systemd[1]: var-lib-docker-overlay2-06a63c38bf0b38196702d168ba789fed9969900a0ac534ea6836a280b3b643d5-merged.mount: Succeeded.
Sep 27 15:36:47.497159 sonic INFO /container: docker cmd: wait for teamd
Sep 27 15:36:47.498422 sonic INFO /container: docker cmd: stop for teamd
Sep 27 15:36:47.498678 sonic DEBUG /container: container_stop: END
Sep 27 15:36:47.527123 sonic NOTICE admin: Stopped teamd service...
Sep 27 15:36:47.529288 sonic INFO systemd[1]: teamd.service: Succeeded.
Sep 27 15:36:47.529910 sonic INFO systemd[1]: Stopped TEAMD container.
Sep 27 15:36:47.535760 sonic NOTICE admin: Stopped  teamd ...

After:

Sep 27 15:43:53.121517 sonic NOTICE admin: Pre-shutdown succeeded, state: pre-shutdown-succeeded ...
Sep 27 15:43:53.355894 sonic NOTICE admin: Backing up database ...
Sep 27 15:43:53.429249 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0003' has been removed.
Sep 27 15:43:53.429249 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0002' has been removed.
Sep 27 15:43:53.429249 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0004' has been removed.
Sep 27 15:43:53.429368 sonic NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0001' has been removed.
Sep 27 15:43:53.523066 sonic INFO database#/supervisord: redis 36:M 27 Sep 2021 15:43:53.522 * DB saved on disk
Sep 27 15:43:53.647846 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-sonic.mount: Succeeded.
Sep 27 15:43:53.673307 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-usr-share-sonic-platform.mount: Succeeded.
Sep 27 15:43:53.706143 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-run-redis.mount: Succeeded.
Sep 27 15:43:53.753142 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-resolv.conf.mount: Succeeded.
Sep 27 15:43:53.776188 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-hostname.mount: Succeeded.
Sep 27 15:43:53.799224 sonic INFO systemd[1]: var-lib-docker-overlay2-669b2b0b164bc3da2be6b7bed86437d1a47a9235297a62c71e1f159a292ba970-merged-etc-hosts.mount: Succeeded.
Sep 27 15:43:53.956416 sonic NOTICE admin: Stopping teamd ...
Sep 27 15:43:53.965053 sonic INFO systemd[1]: Stopping TEAMD container...
Sep 27 15:43:53.973066 sonic NOTICE admin: Stopping teamd service...
Sep 27 15:43:54.179522 sonic NOTICE admin: Warm boot flag: teamd true.
Sep 27 15:43:54.186392 sonic NOTICE admin: Fast boot flag: teamd false.
Sep 27 15:43:54.540820 sonic DEBUG /container: container_stop: BEGIN
Sep 27 15:43:54.541168 sonic DEBUG /container: read_data: config:True feature:teamd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Sep 27 15:43:54.541496 sonic DEBUG /container: read_data: config:False feature:teamd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Sep 27 15:43:54.541765 sonic DEBUG /container: container_stop: teamd: set_owner:local current_owner:none remote_state:none docker_id:teamd
Sep 27 15:43:55.103875 sonic INFO teamd#supervisord 2021-09-27 15:43:55,101 INFO reaped unknown pid 26 (exit status 0)
Sep 27 15:43:55.103875 sonic INFO teamd#supervisord 2021-09-27 15:43:55,102 INFO reaped unknown pid 34 (exit status 0)
Sep 27 15:43:55.103875 sonic INFO teamd#supervisord 2021-09-27 15:43:55,102 INFO reaped unknown pid 42 (exit status 0)
Sep 27 15:43:55.103875 sonic INFO teamd#supervisord 2021-09-27 15:43:55,102 INFO reaped unknown pid 50 (exit status 0)
Sep 27 15:43:56.104759 sonic INFO teamd#supervisord 2021-09-27 15:43:56,104 WARN received SIGTERM indicating exit request
Sep 27 15:43:56.105595 sonic INFO teamd#supervisord 2021-09-27 15:43:56,104 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd, teamsyncd, tlm_teamd to die
Sep 27 15:43:57.105912 sonic NOTICE teamd#tlm_teamd: :- main: Exiting
Sep 27 15:43:57.817711 sonic INFO teamd#supervisord 2021-09-27 15:43:57,817 INFO stopped: tlm_teamd (exit status 0)
Sep 27 15:43:58.818834 sonic NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Sep 27 15:43:58.820140 sonic NOTICE teamd#teamsyncd: :- main: Received SIGTERM Exiting
Sep 27 15:43:58.948445 sonic INFO teamd#supervisord 2021-09-27 15:43:58,947 INFO stopped: teamsyncd (exit status 0)
Sep 27 15:43:59.949582 sonic NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Sep 27 15:43:59.950284 sonic INFO teamd#supervisord 2021-09-27 15:43:59,949 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd to die
Sep 27 15:43:59.953287 sonic INFO teamd#/supervisord: teammgrd cat: /var/run/teamd/PortChannel0001.pid: No such file or directory
Sep 27 15:43:59.954101 sonic ERR teamd#teammgrd: :- main: Runtime error: cat "/var/run/teamd/PortChannel0001.pid" :
Sep 27 15:44:00.790477 sonic INFO teamd#supervisord 2021-09-27 15:44:00,789 INFO stopped: teammgrd (exit status 1)
Sep 27 15:44:01.939112 sonic INFO containerd[486]: time="2021-09-27T15:44:01.936948014Z" level=info msg="shim reaped" id=90f942c9e7be6991ae0ce30bc71738de51f1f599773106caf1c70860cb87731a
Sep 27 15:44:01.949213 sonic INFO dockerd[675]: time="2021-09-27T15:44:01.947031450Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 27 15:44:01.958717 sonic INFO systemd[1]: var-lib-docker-containers-90f942c9e7be6991ae0ce30bc71738de51f1f599773106caf1c70860cb87731a-mounts-shm.mount: Succeeded.
Sep 27 15:44:01.982581 sonic INFO systemd[1]: var-lib-docker-overlay2-06a63c38bf0b38196702d168ba789fed9969900a0ac534ea6836a280b3b643d5-merged.mount: Succeeded.
Sep 27 15:44:02.026851 sonic INFO /container: docker cmd: wait for teamd
Sep 27 15:44:02.028477 sonic INFO /container: docker cmd: stop for teamd
Sep 27 15:44:02.028767 sonic DEBUG /container: container_stop: END
Sep 27 15:44:02.057161 sonic NOTICE admin: Stopped teamd service...
Sep 27 15:44:02.060093 sonic INFO systemd[1]: teamd.service: Succeeded.
Sep 27 15:44:02.061628 sonic INFO systemd[1]: Stopped TEAMD container.

Why I did it

  • To fix teamd signal handling:
Sep 27 15:36:40.755013 sonic INFO teamd#supervisord 2021-09-27 15:36:40,754 INFO exited: tlm_teamd (terminated by SIGUSR1; not expected)
Sep 27 15:36:41.766193 sonic INFO teamd#/supervisor-proc-exit-listener: Process 'tlm_teamd' exited unexpectedly. Terminating supervisor 'teamd'

How I did it

  • N/A

How to verify it

  • N/A

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

  • N/A

A picture of a cute animal (not mandatory but encouraged)

      .---.        .-----------
     /     \  __  /    ------
    / /     \(  )/    -----
   //////   ' \/ `   ---
  //// / // :    : ---
 // /   /  /`    '--
//          //..\\
       ====UU====UU====
           '//||\\`
             ''``

@nazariig
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nazariig
Copy link
Collaborator Author

==================================== ERRORS ====================================
____________________ ERROR at teardown of test_warm_reboot _____________________

fixture_generator = <function verify_dut_health at 0x7f45e3a10150>
args = (<SubRequest 'verify_dut_health' for <Function test_warm_reboot>>, <tests.common.devices.duthosts.DutHosts object at 0..., {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...})
kargs = {}, it = <generator object verify_dut_health at 0x7f45e1b84dc0>
res = None

    @decorator.decorator
    def _fixture_generator_decorator(fixture_generator, *args, **kargs):
        # setup, to the first yield in fixture
        logging.info("-" * 20 + (" fixture %s setup starts " % fixture_generator.__name__) + "-" * 20)
        it = fixture_generator(*args, **kargs)
        try:
            res = next(it)
            logging.info("-" * 20 + (" fixture %s setup ends " % fixture_generator.__name__) + "-" * 20)
            yield res
        except Exception as detail:
            logging.exception("\n%r", detail)
            logging.info("-" * 20 + (" fixture %s setup ends " % fixture_generator.__name__) + "-" * 20)
            raise
    
        # teardown, fixture will raise StopIteration
        logging.info("-" * 20 + (" fixture %s teardown starts " % fixture_generator.__name__) + "-" * 20)
        try:
>           next(it)

args       = (<SubRequest 'verify_dut_health' for <Function test_warm_reboot>>, <tests.common.devices.duthosts.DutHosts object at 0..., {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...})
fixture_generator = <function verify_dut_health at 0x7f45e3a10150>

it         = <generator object verify_dut_health at 0x7f45e1b84dc0>
kargs      = {}
res        = None

common/plugins/log_section_start/__init__.py:84: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

request = <SubRequest 'verify_dut_health' for <Function test_warm_reboot>>
duthosts = <tests.common.devices.duthosts.DutHosts object at 0x7f45e393b710>
rand_one_dut_hostname = 'vlab-01'
tbinfo = {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...}

    @pytest.fixture
    def verify_dut_health(request, duthosts, rand_one_dut_hostname, tbinfo):
        global test_report
        test_report = {}
        duthost = duthosts[rand_one_dut_hostname]
        check_services(duthost)
        check_interfaces_and_transceivers(duthost, request)
        check_neighbors(duthost, tbinfo)
        if "20191130" in duthost.os_version:
            pre_existing_cores = duthost.shell('ls /var/core/ | grep -v python | wc -l')['stdout']
        else:
            pre_existing_cores = duthost.shell('ls /var/core/ | wc -l')['stdout']
        pytest_assert(all(list(test_report.values())), "DUT not ready for test. Health check failed before reboot: {}"
            .format(test_report))
    
        yield
    
        test_report = {}
        check_services(duthost)
        check_interfaces_and_transceivers(duthost, request)
        check_neighbors(duthost, tbinfo)
        verify_no_coredumps(duthost, pre_existing_cores)
        check_all = all([check == True for check in list(test_report.values())])
        pytest_assert(check_all, "Health check failed after reboot: {}"

>           .format(test_report))
E       Failed: Health check failed after reboot: {'check_interfaces_and_transceivers': True, 'check_services': True, 'verify_no_coredumps': 'Core dumps found. Expected: 0 Found: 1', 'check_neighbors': True}

check      = True
check_all  = False
duthost    = <MultiAsicSonicHost> vlab-01
duthosts   = <tests.common.devices.duthosts.DutHosts object at 0x7f45e393b710>
pre_existing_cores = u'0'
rand_one_dut_hostname = 'vlab-01'
request    = <SubRequest 'verify_dut_health' for <Function test_warm_reboot>>
tbinfo     = {'auto_recover': 'False', 'comment': 'Tests virtual switch vm', 'conf-name': 'vms-kvm-t0', 'duts': ['vlab-01'], ...}

platform_tests/verify_dut_health.py:157: Failed
- generated xml file: /data/sonic-mgmt/tests/logs/1vlan/platform_tests/test_advanced_reboot.xml -

@liat-grozovik
Copy link
Collaborator

/azp run

@liat-grozovik
Copy link
Collaborator

@nazariig you marked it needed for 202012, not needed for 202106?

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nazariig
Copy link
Collaborator Author

nazariig commented Oct 5, 2021

@liat-grozovik FYI. This improvement requires more investigation since produces core dumps after warm-reboot on KVM

@nazariig nazariig force-pushed the master-lag-warm-fast-notification branch from 6a95283 to 56d0497 Compare October 5, 2021 17:43
@nazariig nazariig force-pushed the master-lag-warm-fast-notification branch from 56d0497 to 2f26999 Compare October 19, 2021 10:19
@judyjoseph
Copy link
Contributor

@liat-grozovik FYI. This improvement requires more investigation since produces core dumps after warm-reboot on KVM

@nazariig Changes are ok - do you still see this issue ? Plan to merge this in else.

@nazariig
Copy link
Collaborator Author

@liat-grozovik FYI. This improvement requires more investigation since produces core dumps after warm-reboot on KVM

@nazariig Changes are ok - do you still see this issue ? Plan to merge this in else.

@judyjoseph no, i don't see it anymore. The CI looks good on the latest master.

@judyjoseph judyjoseph merged commit 453346f into sonic-net:master Oct 26, 2021
qiluo-msft pushed a commit that referenced this pull request Oct 27, 2021
To fix teamd signal handling, without which Process 'tlm_teamd' exited unexpectedly
judyjoseph pushed a commit that referenced this pull request Nov 6, 2021
To fix teamd signal handling, without which Process 'tlm_teamd' exited unexpectedly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants