[config] Eliminate race condition between reloading Monit config and monitoring container_checker #1543

yozhao101 · 2021-04-02T21:16:38Z

Signed-off-by: Yong Zhao [email protected]

What I did

Nightly test found a failure when we ran the command sudo config reload/load_minigraph, The error message is:

admin@str-a7050-acs-1:~$ sudo config reload -y
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Resetting failed status on bgp.service
Resetting failed status on caclmgrd.service
Resetting failed status on dhcp_relay.service
Resetting failed status on hostcfgd.service
Resetting failed status on hostname-config.service
Resetting failed status on interfaces-config.service
Resetting failed status on lldp.service
Resetting failed status on ntp-config.service
Resetting failed status on pmon.service
Resetting failed status on procdockerstatsd.service
Resetting failed status on radv.service
Resetting failed status on rsyslog-config.service
Resetting failed status on swss.service
Resetting failed status on syncd.service
Resetting failed status on teamd.service
Resetting failed status on telemetry.timer
Restarting SONiC target ...
Reloading Monit configuration ...
Reinitializing monit daemon
Enabling container monitoring ...
Unix socket /var/run/monit.sock connection error -- No such file or directory

The root reason is that there exists an implicit race condition between the command sudo monit reload at line 701 and
the command sudo monit monitor container_checker at line 706. Both commands need access the monit.sock socket file
under the directory /var/run/.

Specifically the sudo monit reload at line 701 will re-initialize the Monit daemon, delete old monit.sock file and then create a new one. During this re-initializing process, the command sudo monit status can always execute successfully at line 704 before the old monit.sock file was deleted, but the command sudo monit monitor container_checker at line 706 will only succeed if the new monit.sock was created, otherwise it will fail and raise this error message.

How I did it

I changed the sequence between the operation to reload Monit configuration and the operation to enable monitoring container_checker.

How to verify it

I verified this change on DuT str-a7050-acs-1 by running the command sudo config reload/load_minigraph -y to make sure the error was not raised again.

Previous command output (if the output of a command-line utility has changed)

admin@str-a7050-acs-1:~$ sudo config reload -y
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Resetting failed status on bgp.service
Resetting failed status on caclmgrd.service
Resetting failed status on dhcp_relay.service
Resetting failed status on hostcfgd.service
Resetting failed status on hostname-config.service
Resetting failed status on interfaces-config.service
Resetting failed status on lldp.service
Resetting failed status on ntp-config.service
Resetting failed status on pmon.service
Resetting failed status on procdockerstatsd.service
Resetting failed status on radv.service
Resetting failed status on rsyslog-config.service
Resetting failed status on swss.service
Resetting failed status on syncd.service
Resetting failed status on teamd.service
Resetting failed status on telemetry.timer
Restarting SONiC target ...
Reloading Monit configuration ...
Reinitializing monit daemon
Enabling container monitoring ...

New command output (if the output of a command-line utility has changed)

admin@str-a7050-acs-1:~$ sudo config reload -y
Disabling container monitoring ...
Stopping SONiC target ...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/local/bin/db_migrator.py -o migrate
Resetting failed status on bgp.service
Resetting failed status on caclmgrd.service
Resetting failed status on dhcp_relay.service
Resetting failed status on hostcfgd.service
Resetting failed status on hostname-config.service
Resetting failed status on interfaces-config.service
Resetting failed status on lldp.service
Resetting failed status on ntp-config.service
Resetting failed status on pmon.service
Resetting failed status on procdockerstatsd.service
Resetting failed status on radv.service
Resetting failed status on rsyslog-config.service
Resetting failed status on swss.service
Resetting failed status on syncd.service
Resetting failed status on teamd.service
Resetting failed status on telemetry.timer
Restarting SONiC target ...
Enabling container monitoring ...
Reloading Monit configuration ...
Reinitializing monit daemon

enabling container_checker. Signed-off-by: Yong Zhao <[email protected]>

jleveque · 2021-04-02T22:19:13Z

@yozhao101: Please add the "Request for " label if requesting a cherry-pick. the "Included in label is used to indicate cherry-pick has been done, or if a PR is targeted at a release branch.

lguohan · 2021-04-04T03:21:42Z

/azp run

azure-pipelines · 2021-04-04T03:21:52Z

Azure Pipelines successfully started running 1 pipeline(s).

yozhao101 · 2021-04-04T21:36:06Z

@yozhao101: Please add the "Request for " label if requesting a cherry-pick. the "Included in label is used to indicate cherry-pick has been done, or if a PR is targeted at a release branch.

Thank Joe! I will pay attention to such issue in future.

…1543) Signed-off-by: Yong Zhao [email protected] What I did Nightly test found a failure when we ran the command sudo config reload/load_minigraph, The error message is: admin@str-a7050-acs-1:~$ sudo config reload -y Disabling container monitoring ... Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Resetting failed status on bgp.service Resetting failed status on caclmgrd.service Resetting failed status on dhcp_relay.service Resetting failed status on hostcfgd.service Resetting failed status on hostname-config.service Resetting failed status on interfaces-config.service Resetting failed status on lldp.service Resetting failed status on ntp-config.service Resetting failed status on pmon.service Resetting failed status on procdockerstatsd.service Resetting failed status on radv.service Resetting failed status on rsyslog-config.service Resetting failed status on swss.service Resetting failed status on syncd.service Resetting failed status on teamd.service Resetting failed status on telemetry.timer Restarting SONiC target ... Reloading Monit configuration ... Reinitializing monit daemon Enabling container monitoring ... Unix socket /var/run/monit.sock connection error -- No such file or directory The root reason is that there exists an implicit race condition between the command sudo monit reload at line 701 and the command sudo monit monitor container_checker at line 706. Both commands need access the monit.sock socket file under the directory /var/run/. Specifically the sudo monit reload at line 701 will re-initialize the Monit daemon, delete old monit.sock file and then create a new one. During this re-initializing process, the command sudo monit status can always execute successfully at line 704 before the old monit.sock file was deleted, but the command sudo monit monitor container_checker at line 706 will only succeed if the new monit.sock was created, otherwise it will fail and raise this error message. How I did it I changed the sequence between the operation to reload Monit configuration and the operation to enable monitoring container_checker. How to verify it I verified this change on DuT str-a7050-acs-1 by running the command sudo config reload/load_minigraph -y to make sure the error was not raised again. Previous command output (if the output of a command-line utility has changed) admin@str-a7050-acs-1:~$ sudo config reload -y Disabling container monitoring ... Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Resetting failed status on bgp.service Resetting failed status on caclmgrd.service Resetting failed status on dhcp_relay.service Resetting failed status on hostcfgd.service Resetting failed status on hostname-config.service Resetting failed status on interfaces-config.service Resetting failed status on lldp.service Resetting failed status on ntp-config.service Resetting failed status on pmon.service Resetting failed status on procdockerstatsd.service Resetting failed status on radv.service Resetting failed status on rsyslog-config.service Resetting failed status on swss.service Resetting failed status on syncd.service Resetting failed status on teamd.service Resetting failed status on telemetry.timer Restarting SONiC target ... Reloading Monit configuration ... Reinitializing monit daemon Enabling container monitoring ... New command output (if the output of a command-line utility has changed) admin@str-a7050-acs-1:~$ sudo config reload -y Disabling container monitoring ... Stopping SONiC target ... Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db Running command: /usr/local/bin/db_migrator.py -o migrate Resetting failed status on bgp.service Resetting failed status on caclmgrd.service Resetting failed status on dhcp_relay.service Resetting failed status on hostcfgd.service Resetting failed status on hostname-config.service Resetting failed status on interfaces-config.service Resetting failed status on lldp.service Resetting failed status on ntp-config.service Resetting failed status on pmon.service Resetting failed status on procdockerstatsd.service Resetting failed status on radv.service Resetting failed status on rsyslog-config.service Resetting failed status on swss.service Resetting failed status on syncd.service Resetting failed status on teamd.service Resetting failed status on telemetry.timer Restarting SONiC target ... Enabling container monitoring ... Reloading Monit configuration ... Reinitializing monit daemon

[config] Eliminate race condition between reloading Monit config and

a6429d1

enabling container_checker. Signed-off-by: Yong Zhao <[email protected]>

yozhao101 changed the title ~~[config] Eliminate race condition between reloading Monit config and enabling container_checker~~ [config] Eliminate race condition between reloading Monit config and monitoring container_checker Apr 2, 2021

yozhao101 added Bug Included in 202012 Branch labels Apr 2, 2021

yozhao101 requested a review from jleveque April 2, 2021 21:56

jleveque added Request for 202012 Branch and removed Included in 202012 Branch labels Apr 2, 2021

jleveque approved these changes Apr 2, 2021

View reviewed changes

yozhao101 merged commit 9bbc25f into sonic-net:master Apr 4, 2021

yozhao101 deleted the eliminate_race_condition branch April 4, 2021 21:39

yozhao101 mentioned this pull request Apr 4, 2021

[sonic-utilities] Update submodule. sonic-net/sonic-buildimage#7227

Merged

yxieca added the Included in 202012 Branch label Apr 8, 2021

yozhao101 mentioned this pull request Apr 15, 2021

Unix socket /var/run/monit.sock connection error -- No such file or directory sonic-net/sonic-buildimage#7024

Open

yozhao101 mentioned this pull request Aug 22, 2021

[sonic-utilities] Update the submodule of sonic-utilities. sonic-net/sonic-buildimage#8551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[config] Eliminate race condition between reloading Monit config and monitoring container_checker #1543

[config] Eliminate race condition between reloading Monit config and monitoring container_checker #1543

yozhao101 commented Apr 2, 2021 •

edited

Loading

jleveque commented Apr 2, 2021

lguohan commented Apr 4, 2021

azure-pipelines bot commented Apr 4, 2021

yozhao101 commented Apr 4, 2021

[config] Eliminate race condition between reloading Monit config and monitoring container_checker #1543

[config] Eliminate race condition between reloading Monit config and monitoring container_checker #1543

Conversation

yozhao101 commented Apr 2, 2021 • edited Loading

What I did

How I did it

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

jleveque commented Apr 2, 2021

lguohan commented Apr 4, 2021

azure-pipelines bot commented Apr 4, 2021

yozhao101 commented Apr 4, 2021

yozhao101 commented Apr 2, 2021 •

edited

Loading