Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[monit] monit reported 'process is not running' even the process was started #5751

Closed
bingwang-ms opened this issue Oct 30, 2020 · 1 comment

Comments

@bingwang-ms
Copy link
Contributor

Description
Some issues are found about monit service.

  1. Monit is configured to report an error for 5 consecutive failure, but the monit was reporting error in each cycle.
check program telemetry|telemetry with path "/usr/bin/process_checker telemetry /usr/sbin/telemetry"
    if status != 0 for 5 times within 5 cycles then alert

Errors reported in each cycle:

Oct 30 07:09:13.108807 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 07:09:03,446 INFO waiting for telemetry to stop
Oct 30 07:09:13.108807 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 07:09:03,450 INFO stopped: telemetry (terminated by SIGTERM)
Oct 30 07:10:04.936231 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.
Oct 30 07:11:05.510358 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running. 
Oct 30 07:12:06.113958 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running. 
......
  1. Monit was reporting some fake alarms. Sometimes the process being monitored is running, but monit will report an ERROR msg in syslog. As a result, LogAnalyzer will report test error for some test cases, which may be unreliable.

Steps to reproduce the issue:
Following steps reproduce issue 2.
1.Stop telemetry in telemetry container with supervisorctl stop telemetry. Then the monit will detect the error, and report an ERROR msg in syslog.

Oct 30 06:38:46.577725 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.

2.Restart telemetry in telemetry container. The monit shouldn’t report ERROR in next cycle, but it did. I confirmed that the telemetry was running at that time. It seems that monit picked up the last failed state and reported it as an error.

Oct 30 06:39:01.539567 str-7260cx3-acs-2 NOTICE acms#root: Waiting for bootstrap cert
Oct 30 06:39:11.752802 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 06:39:05,322 INFO spawned: 'telemetry' with pid 79
Oct 30 06:39:11.752802 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 06:39:07,176 INFO success: telemetry entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Oct 30 06:39:47.122360 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running. 

Describe the results you received:
Monit work as expected.

Describe the results you expected:
Monit didn't work as configured, and some fake alarms were found.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

The issues are found in both 201911 and master branch.

SONiC Software Version: SONiC.20191130.51
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: ff6ec30ff
Build date: Fri Oct 16 11:51:41 UTC 2020
Built by: sonicbld@jenkins-slave-phx-2

Platform: x86_64-arista_7260cx3_64
HwSKU: Arista-7260CX3-D108C8
ASIC: broadcom
Serial Number: SSJ17432414
Uptime: 07:15:39 up 46 min,  2 users,  load average: 4.40, 3.43, 2.98

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-snmp-sv2            20191130.51         e520a5c3e971        348MB
docker-snmp-sv2            latest              e520a5c3e971        348MB
docker-fpm-frr             20191130.51         cb54d4a581a8        335MB
docker-fpm-frr             latest              cb54d4a581a8        335MB
docker-lldp-sv2            20191130.51         a1b7b80da75b        312MB
docker-lldp-sv2            latest              a1b7b80da75b        312MB
docker-acms                20191130.51         9913b8c8dc93        182MB
docker-acms                latest              9913b8c8dc93        182MB
docker-orchagent           20191130.51         7658cbe496e2        333MB
docker-orchagent           latest              7658cbe496e2        333MB
docker-teamd               20191130.51         8ad424580cd1        314MB
docker-teamd               latest              8ad424580cd1        314MB
docker-syncd-brcm          20191130.51         de3cc54805fd        436MB
docker-syncd-brcm          latest              de3cc54805fd        436MB
docker-platform-monitor    20191130.51         5083fd751de1        357MB
docker-platform-monitor    latest              5083fd751de1        357MB
docker-sonic-telemetry     20191130.51         d3fb957dd40c        353MB
docker-sonic-telemetry     latest              d3fb957dd40c        353MB
docker-database            20191130.51         255dbf2854be        289MB
docker-database            latest              255dbf2854be        289MB
docker-dhcp-relay          20191130.51         55fa577c9163        299MB
docker-dhcp-relay          latest              55fa577c9163        299MB
docker-router-advertiser   20191130.51         1fc4e54211dd        289MB
docker-router-advertiser   latest              1fc4e54211dd        289MB
k8s.gcr.io/pause           3.2                 80d28bedfe5d        683kB
**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@abdosi
Copy link
Contributor

abdosi commented Nov 3, 2020

Issue 1> Fixed with PR: #5720

Issue 2> Monit Issue as listed here https://bitbucket.org/tildeslash/monit/issues/19/race-condition-when-using-check-program

The problem is, that the "check program" is always one cycle behind the reality. This is due to design limitation of current Monit test scheduler - to not block the validation engine with check program execution (runtime can be variable), we execute the program in one cycle, let it finish in the background and collect the exit status in next cycle + evaluate the result. If the status failed, action is done AND at the end of the cycle the check program is started again, so the exit status can be collected in the next cycle.

@abdosi abdosi closed this as completed Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants