You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
Some issues are found about monit service.
Monit is configured to report an error for 5 consecutive failure, but the monit was reporting error in each cycle.
check program telemetry|telemetry with path "/usr/bin/process_checker telemetry /usr/sbin/telemetry"
if status != 0 for 5 times within 5 cycles then alert
Errors reported in each cycle:
Oct 30 07:09:13.108807 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 07:09:03,446 INFO waiting for telemetry to stop
Oct 30 07:09:13.108807 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 07:09:03,450 INFO stopped: telemetry (terminated by SIGTERM)
Oct 30 07:10:04.936231 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.
Oct 30 07:11:05.510358 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.
Oct 30 07:12:06.113958 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.
......
Monit was reporting some fake alarms. Sometimes the process being monitored is running, but monit will report an ERROR msg in syslog. As a result, LogAnalyzer will report test error for some test cases, which may be unreliable.
Steps to reproduce the issue:
Following steps reproduce issue 2.
1.Stop telemetry in telemetry container with supervisorctl stop telemetry. Then the monit will detect the error, and report an ERROR msg in syslog.
Oct 30 06:38:46.577725 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.
2.Restart telemetry in telemetry container. The monit shouldn’t report ERROR in next cycle, but it did. I confirmed that the telemetry was running at that time. It seems that monit picked up the last failed state and reported it as an error.
Oct 30 06:39:01.539567 str-7260cx3-acs-2 NOTICE acms#root: Waiting for bootstrap cert
Oct 30 06:39:11.752802 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 06:39:05,322 INFO spawned: 'telemetry' with pid 79
Oct 30 06:39:11.752802 str-7260cx3-acs-2 INFO telemetry#supervisord 2020-10-30 06:39:07,176 INFO success: telemetry entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Oct 30 06:39:47.122360 str-7260cx3-acs-2 ERR monit[634]: 'telemetry|telemetry' status failed (1) -- '/usr/sbin/telemetry' is not running.
Describe the results you received:
Monit work as expected.
Describe the results you expected:
Monit didn't work as configured, and some fake alarms were found.
Additional information you deem important (e.g. issue happens only occasionally):
**Output of `show version`:**
The issues are found in both 201911 and master branch.
The problem is, that the "check program" is always one cycle behind the reality. This is due to design limitation of current Monit test scheduler - to not block the validation engine with check program execution (runtime can be variable), we execute the program in one cycle, let it finish in the background and collect the exit status in next cycle + evaluate the result. If the status failed, action is done AND at the end of the cycle the check program is started again, so the exit status can be collected in the next cycle.
Description
Some issues are found about
monit
service.Errors reported in each cycle:
Steps to reproduce the issue:
Following steps reproduce issue 2.
1.Stop telemetry in telemetry container with
supervisorctl stop telemetry
. Then the monit will detect the error, and report an ERROR msg in syslog.2.Restart telemetry in telemetry container. The monit shouldn’t report ERROR in next cycle, but it did. I confirmed that the telemetry was running at that time. It seems that monit picked up the last failed state and reported it as an error.
Describe the results you received:
Monit work as expected.
Describe the results you expected:
Monit didn't work as configured, and some fake alarms were found.
Additional information you deem important (e.g. issue happens only occasionally):
The issues are found in both 201911 and master branch.
The text was updated successfully, but these errors were encountered: