Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350

pedrojimenez · 2014-10-24T07:16:07Z

We have our Server (Shinken 2.0.3) running in a VMware virtual machine with VMware Time Sync disabled, doing the sync via NTP client by the Operating System (Ubuntu 12.04). We are getting a Compesating message everyday and the broker daemon get crashed.

This is the event:

[broker-master] A system time change of 1658.38716102 has been detected.  Compensating...

After a while the broker daemon threw these entries in the log:

2014-10-24 00:37:42,018 [1414103860] Critical : [broker-master] I got an unrecoverable error. I have to exit
2014-10-24 00:37:43,941 [1414103863] Critical : [broker-master] You can log a bug ticket at https://github.com/naparuba/shinken/issues/new to get help
2014-10-24 00:37:44,666 [1414103864] Critical : [broker-master] Back trace of it: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 728, in main
    self.do_mainloop()
  File "/usr/local/lib/python2.7/dist-packages/shinken/daemon.py", line 322, in do_mainloop
    self.do_loop_turn()
  File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn
    q.put(to_send)
  File "<string>", line 2, in put
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
    kind, result = conn.recv()
EOFError

After the error, the broker daemon still remains as pid, but all communication from/to broker are lost. The arbiter is looping forever trying to find the broker.

We have tried to test this behaviour with/without VMwareTools. It happens in both cases.

The text was updated successfully, but these errors were encountered:

naparuba · 2014-10-24T08:34:55Z

is your system really loosing time?

On Fri, Oct 24, 2014 at 9:16 AM, Pedro Jiménez Solís <
[email protected]> wrote:

We have our Server (Shinken 2.0.3) running in a VMware virtual machine
with NTP sync disabled, only via Operating System (Ubuntu 12.04). We are
getting a Compesating message everyday and the broker daemon get
crashed.

This is the event:

[broker-master] A system time change of 1658.38716102 has been detected. Compensating...

After a while the broker daemon threw these entries in the log:

2014-10-24 00:37:42,018 [1414103860] Critical : [broker-master] I got an unrecoverable error. I have to exit
2014-10-24 00:37:43,941 [1414103863] Critical : [broker-master] You can log a bug ticket at https://github.com/naparuba/shinken/issues/new to get help
2014-10-24 00:37:44,666 [1414103864] Critical : [broker-master] Back trace of it: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 728, in main
self.do_mainloop()
File "/usr/local/lib/python2.7/dist-packages/shinken/daemon.py", line 322, in do_mainloop
self.do_loop_turn()
File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn
q.put(to_send)
File "", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError

After the error, the broker daemon still remains as pid, but all
communication from/to broker are lost. The arbiter is looping forever
trying to find the broker.

We have tried to test this behaviour with/without VMwareTools. It happens
in both cases.

—
Reply to this email directly or view it on GitHub
#1350.

pedrojimenez · 2014-10-24T08:40:58Z

Hi @naparuba,
I don't think so, we are syncing every hour against a NTP Server.
Even, doing the same sync manually the time difference is really low (ms)

root@:~#  ntpdate <ourNTPserver>
24 Oct 10:39:48 ntpdate[14039]: adjust time server ourNTPserver offset -0.003330 sec

dgilm · 2014-10-24T09:20:48Z

Broker modules loaded: webui,livestatus,ui-pnp,npcdmod,canopsis

We are going to disable the canopsis module and try, because I guess there is a grave problem with the shinken queue management in mod-canopsis. I would appreciate if you could take a look at shinken-monitoring/mod-canopsis#9 in case these two issues were related.

dgilm · 2014-10-28T07:46:24Z

It seems that the problem is in mod-canopsis as I guess. No crashes since we disable the module.

pedrojimenez · 2014-11-03T07:45:39Z

Update:
There has been no more problems after a week with this module disabled.
We are going to investigate this more deeply , at least with no more daemon crashes.

We will keep the info updated.

Regards.

gst · 2014-11-05T20:40:51Z

Looking at the tracback:

File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn
q.put(to_send)
File "", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError

it's "clear" to me that a broker module has "died" (its queue is "EOF" , meaning the queue has been closed by the remote side).

the question is : how to react on that at this level.. TBC.

gst · 2014-11-05T21:29:39Z

Just about : > After the error, the broker daemon still remains as pid

what did you mean "as pid" ? the process was still alive ? (should not imo).

anyway, the broker could effectively catch this error and try to relaunch the failed module, if feasible..

dgilm · 2014-11-05T21:30:14Z

I think @pedrojimenez didn't notice you in the issue that there is an "out of memory" problem between the "system time change" and the consecuent python exception.

When this problem occurs, broker consumes more and more memory, until the system ran out of ram and swap. At this moment, I think the linux kernel is responsible of closing the queue to protect the system. When the exceptions raises, the memory and swap usage is released. Anyway, you need to restart the broker in order to get a fully functional environment.

We have seen and verify this behaviour plenty of times (one per night) checking a munin installed in the system. The RRD graphs generated are self-explanatory: the time of the excepction is the same time that the out of memory event occurs, followed by the memory release.

That's my guess, I hope it can help.

dgilm · 2014-11-05T21:36:19Z

Yes, the broker process doesn't die, but it doesn't write logs and doesn't communicate with other modules, it's like a zombie process that you need to restart if you want it to work again.

gst · 2014-11-05T21:48:01Z

ok i (guess to) see more clearly.

Anyway : if the broker process wasn't died/exited after the exception then it means some of its modules are daemon threads, more than probably.. I'm quite sure this shouldn't be case though : we want the broker process to be fully exited/terminated (and all its modules !) when the main thread exit (normally or not).

If i found some time I'll look at the modules that were used about this.

The RRD graphs generated are self-explanatory: the time of the excepction is the same time that the out of memory event occurs

The out-of-memory could probably have killed the canopsis (or another one) module -> as you say its queue is then closed by kernel -> EOF when broker try to send/write to it.

The side question is : what was causing the out-of-memory.

dgilm · 2014-11-06T07:38:14Z

The "Compensating system time" seems to be the trigger and the shinken-monitoring/mod-canopsis#9 indicates a problem with the queue management. If you want us to try something or need more info, please tell us.

dguenault · 2014-11-06T07:49:42Z

well the retention saving part is 1 second tick based so there is a tick
increment every seconds if ticks count > queue_dump_frequency then
retention AND there are queued data then the queue is saved on disk
(canopsis.dat file).

are you sure the canopsis module connect to canopsis amqp daemon to write
events ? Search the logs for events message like
[Canopsis] Unexpected error
[Canopsis] saving to
[Canopsis] Not connected, going to queue messages until connection back
[Canopsis] Maximum retention for event queue
Couldn't publish message <= need to prepend [Canopsis] to this event ...

if you find some of those messages then you have a problem connecting
and/or publishing messages to canopsis. Review your module configuration
and try to tune the retention queue size and dump frequency (with lower
values for retention size). The queue is in memory unless it is flushed
(50000 events is the default value)

2014-11-06 8:38 GMT+01:00 David Gil [email protected]:

The "Compensating system time" seems to be the trigger and the
shinken-monitoring/mod-canopsis#9
shinken-monitoring/mod-canopsis#9 indicates a
problem with the queue management. If you want us to try something or need
more info, please tell us.

—
Reply to this email directly or view it on GitHub
#1350 (comment).

dgilm · 2014-11-06T07:54:38Z

Hi @dguenault,

Yes, I'm pretty sure that canopsis module is connecting to canopsis amqp and writting events (service events by default, host events applying the shinken-monitoring/mod-canopsis#9 patch)

david-guenault · 2014-11-06T07:59:22Z

did you notice any of the messages in logs ?

2014-11-06 8:54 GMT+01:00 David Gil [email protected]:

Hi @dguenault https://github.com/dguenault,

Yes, I'm pretty sure that canopsis module is connecting to canopsis amqp
and writting events (service events by default, host events applying the
shinken-monitoring/mod-canopsis#9
shinken-monitoring/mod-canopsis#9 patch)

—
Reply to this email directly or view it on GitHub
#1350 (comment).

dgilm · 2014-11-06T08:11:56Z

I have notice those logs in the past due communication problems, but not at the moment of the "system change" error.

In addition, the broker log is full of these messages (generated by the try/Except in the pull request)

Asking reinit: from_q queue is not initialized

dguenault · 2014-11-06T10:16:44Z

mmmh need to build a lab to test all of this. The code is not really clear
on the reinit

2014-11-06 9:11 GMT+01:00 David Gil [email protected]:

I have notice those logs in the past due communication problems, but not
at the moment of the "system change" error.

But the broker log is full of these messages (generated by the try/Except
in the pull request)

Asking reinit: from_q queue is not initialized

—
Reply to this email directly or view it on GitHub
#1350 (comment).

naparuba · 2014-11-12T13:12:30Z

fix proposed, test in progress for no regression, but will only be sure that works well in the long run.

pedrojimenez · 2014-11-12T14:24:08Z

Thanks @naparuba,
We are applying it and activating mod-canopsis again.

If there is any problem we will post it on this thread.

pedrojimenez changed the title ~~Broker crashed due unrecoverable error~~ Broker crashed due unrecoverable error after "system time change" and "Compensating ..." Oct 24, 2014

gst added Broker Feature labels Nov 5, 2014

pedrojimenez mentioned this issue Nov 6, 2014

Poller crashed "unexepectedly" #1366

Closed

naparuba added Bug and removed Feature labels Nov 12, 2014

naparuba added this to the 2.2 (Maleable Muskrat) **doc, cli & arbiters** milestone Nov 12, 2014

naparuba closed this as completed in 466d359 Nov 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350

Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350

pedrojimenez commented Oct 24, 2014

naparuba commented Oct 24, 2014

pedrojimenez commented Oct 24, 2014

dgilm commented Oct 24, 2014

dgilm commented Oct 28, 2014

pedrojimenez commented Nov 3, 2014

gst commented Nov 5, 2014

gst commented Nov 5, 2014

dgilm commented Nov 5, 2014

dgilm commented Nov 5, 2014

gst commented Nov 5, 2014

dgilm commented Nov 6, 2014

dguenault commented Nov 6, 2014

dgilm commented Nov 6, 2014

david-guenault commented Nov 6, 2014

dgilm commented Nov 6, 2014

dguenault commented Nov 6, 2014

naparuba commented Nov 12, 2014

pedrojimenez commented Nov 12, 2014

Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350

Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350

Comments

pedrojimenez commented Oct 24, 2014

naparuba commented Oct 24, 2014

pedrojimenez commented Oct 24, 2014

dgilm commented Oct 24, 2014

dgilm commented Oct 28, 2014

pedrojimenez commented Nov 3, 2014

gst commented Nov 5, 2014

gst commented Nov 5, 2014

dgilm commented Nov 5, 2014

dgilm commented Nov 5, 2014

gst commented Nov 5, 2014

dgilm commented Nov 6, 2014

dguenault commented Nov 6, 2014

dgilm commented Nov 6, 2014

david-guenault commented Nov 6, 2014

dgilm commented Nov 6, 2014

dguenault commented Nov 6, 2014

naparuba commented Nov 12, 2014

pedrojimenez commented Nov 12, 2014