Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350

Closed
pedrojimenez opened this issue Oct 24, 2014 · 18 comments

Comments

@pedrojimenez
Copy link

We have our Server (Shinken 2.0.3) running in a VMware virtual machine with VMware Time Sync disabled, doing the sync via NTP client by the Operating System (Ubuntu 12.04). We are getting a Compesating message everyday and the broker daemon get crashed.

This is the event:

[broker-master] A system time change of 1658.38716102 has been detected.  Compensating...

After a while the broker daemon threw these entries in the log:

2014-10-24 00:37:42,018 [1414103860] Critical : [broker-master] I got an unrecoverable error. I have to exit
2014-10-24 00:37:43,941 [1414103863] Critical : [broker-master] You can log a bug ticket at https://github.com/naparuba/shinken/issues/new to get help
2014-10-24 00:37:44,666 [1414103864] Critical : [broker-master] Back trace of it: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 728, in main
    self.do_mainloop()
  File "/usr/local/lib/python2.7/dist-packages/shinken/daemon.py", line 322, in do_mainloop
    self.do_loop_turn()
  File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn
    q.put(to_send)
  File "<string>", line 2, in put
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
    kind, result = conn.recv()
EOFError

After the error, the broker daemon still remains as pid, but all communication from/to broker are lost. The arbiter is looping forever trying to find the broker.

We have tried to test this behaviour with/without VMwareTools. It happens in both cases.

@pedrojimenez pedrojimenez changed the title Broker crashed due unrecoverable error Broker crashed due unrecoverable error after "system time change" and "Compensating ..." Oct 24, 2014
@naparuba
Copy link
Contributor

is your system really loosing time?

On Fri, Oct 24, 2014 at 9:16 AM, Pedro Jiménez Solís <
[email protected]> wrote:

We have our Server (Shinken 2.0.3) running in a VMware virtual machine
with NTP sync disabled, only via Operating System (Ubuntu 12.04). We are
getting a Compesating message everyday and the broker daemon get
crashed.

This is the event:

[broker-master] A system time change of 1658.38716102 has been detected. Compensating...

After a while the broker daemon threw these entries in the log:

2014-10-24 00:37:42,018 [1414103860] Critical : [broker-master] I got an unrecoverable error. I have to exit
2014-10-24 00:37:43,941 [1414103863] Critical : [broker-master] You can log a bug ticket at https://github.com/naparuba/shinken/issues/new to get help
2014-10-24 00:37:44,666 [1414103864] Critical : [broker-master] Back trace of it: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 728, in main
self.do_mainloop()
File "/usr/local/lib/python2.7/dist-packages/shinken/daemon.py", line 322, in do_mainloop
self.do_loop_turn()
File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn
q.put(to_send)
File "", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError

After the error, the broker daemon still remains as pid, but all
communication from/to broker are lost. The arbiter is looping forever
trying to find the broker.

We have tried to test this behaviour with/without VMwareTools. It happens
in both cases.


Reply to this email directly or view it on GitHub
#1350.

@pedrojimenez
Copy link
Author

Hi @naparuba,
I don't think so, we are syncing every hour against a NTP Server.
Even, doing the same sync manually the time difference is really low (ms)

root@:~#  ntpdate <ourNTPserver>
24 Oct 10:39:48 ntpdate[14039]: adjust time server ourNTPserver offset -0.003330 sec

@dgilm
Copy link
Contributor

dgilm commented Oct 24, 2014

Broker modules loaded: webui,livestatus,ui-pnp,npcdmod,canopsis

We are going to disable the canopsis module and try, because I guess there is a grave problem with the shinken queue management in mod-canopsis. I would appreciate if you could take a look at shinken-monitoring/mod-canopsis#9 in case these two issues were related.

@dgilm
Copy link
Contributor

dgilm commented Oct 28, 2014

It seems that the problem is in mod-canopsis as I guess. No crashes since we disable the module.

@pedrojimenez
Copy link
Author

Update:
There has been no more problems after a week with this module disabled.
We are going to investigate this more deeply , at least with no more daemon crashes.

We will keep the info updated.

Regards.

@gst
Copy link
Contributor

gst commented Nov 5, 2014

Looking at the tracback:

File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn
q.put(to_send)
File "", line 2, in put
File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in _callmethod
kind, result = conn.recv()
EOFError

it's "clear" to me that a broker module has "died" (its queue is "EOF" , meaning the queue has been closed by the remote side).

the question is : how to react on that at this level.. TBC.

@gst
Copy link
Contributor

gst commented Nov 5, 2014

Just about : > After the error, the broker daemon still remains as pid

what did you mean "as pid" ? the process was still alive ? (should not imo).

anyway, the broker could effectively catch this error and try to relaunch the failed module, if feasible..

@dgilm
Copy link
Contributor

dgilm commented Nov 5, 2014

I think @pedrojimenez didn't notice you in the issue that there is an "out of memory" problem between the "system time change" and the consecuent python exception.

When this problem occurs, broker consumes more and more memory, until the system ran out of ram and swap. At this moment, I think the linux kernel is responsible of closing the queue to protect the system. When the exceptions raises, the memory and swap usage is released. Anyway, you need to restart the broker in order to get a fully functional environment.

We have seen and verify this behaviour plenty of times (one per night) checking a munin installed in the system. The RRD graphs generated are self-explanatory: the time of the excepction is the same time that the out of memory event occurs, followed by the memory release.

That's my guess, I hope it can help.

@dgilm
Copy link
Contributor

dgilm commented Nov 5, 2014

Yes, the broker process doesn't die, but it doesn't write logs and doesn't communicate with other modules, it's like a zombie process that you need to restart if you want it to work again.

@gst
Copy link
Contributor

gst commented Nov 5, 2014

ok i (guess to) see more clearly.

Anyway : if the broker process wasn't died/exited after the exception then it means some of its modules are daemon threads, more than probably.. I'm quite sure this shouldn't be case though : we want the broker process to be fully exited/terminated (and all its modules !) when the main thread exit (normally or not).

If i found some time I'll look at the modules that were used about this.

The RRD graphs generated are self-explanatory: the time of the excepction is the same time that the out of memory event occurs

The out-of-memory could probably have killed the canopsis (or another one) module -> as you say its queue is then closed by kernel -> EOF when broker try to send/write to it.

The side question is : what was causing the out-of-memory.

@dgilm
Copy link
Contributor

dgilm commented Nov 6, 2014

The "Compensating system time" seems to be the trigger and the shinken-monitoring/mod-canopsis#9 indicates a problem with the queue management. If you want us to try something or need more info, please tell us.

@dguenault
Copy link
Collaborator

well the retention saving part is 1 second tick based so there is a tick
increment every seconds if ticks count > queue_dump_frequency then
retention AND there are queued data then the queue is saved on disk
(canopsis.dat file).

are you sure the canopsis module connect to canopsis amqp daemon to write
events ? Search the logs for events message like
[Canopsis] Unexpected error
[Canopsis] saving to
[Canopsis] Not connected, going to queue messages until connection back
[Canopsis] Maximum retention for event queue
Couldn't publish message <= need to prepend [Canopsis] to this event ...

if you find some of those messages then you have a problem connecting
and/or publishing messages to canopsis. Review your module configuration
and try to tune the retention queue size and dump frequency (with lower
values for retention size). The queue is in memory unless it is flushed
(50000 events is the default value)

2014-11-06 8:38 GMT+01:00 David Gil [email protected]:

The "Compensating system time" seems to be the trigger and the
shinken-monitoring/mod-canopsis#9
shinken-monitoring/mod-canopsis#9 indicates a
problem with the queue management. If you want us to try something or need
more info, please tell us.


Reply to this email directly or view it on GitHub
#1350 (comment).

@dgilm
Copy link
Contributor

dgilm commented Nov 6, 2014

Hi @dguenault,

Yes, I'm pretty sure that canopsis module is connecting to canopsis amqp and writting events (service events by default, host events applying the shinken-monitoring/mod-canopsis#9 patch)

@david-guenault
Copy link
Collaborator

did you notice any of the messages in logs ?

2014-11-06 8:54 GMT+01:00 David Gil [email protected]:

Hi @dguenault https://github.com/dguenault,

Yes, I'm pretty sure that canopsis module is connecting to canopsis amqp
and writting events (service events by default, host events applying the
shinken-monitoring/mod-canopsis#9
shinken-monitoring/mod-canopsis#9 patch)


Reply to this email directly or view it on GitHub
#1350 (comment).

@dgilm
Copy link
Contributor

dgilm commented Nov 6, 2014

I have notice those logs in the past due communication problems, but not at the moment of the "system change" error.

In addition, the broker log is full of these messages (generated by the try/Except in the pull request)

Asking reinit: from_q queue is not initialized

@dguenault
Copy link
Collaborator

mmmh need to build a lab to test all of this. The code is not really clear
on the reinit

2014-11-06 9:11 GMT+01:00 David Gil [email protected]:

I have notice those logs in the past due communication problems, but not
at the moment of the "system change" error.

But the broker log is full of these messages (generated by the try/Except
in the pull request)

Asking reinit: from_q queue is not initialized


Reply to this email directly or view it on GitHub
#1350 (comment).

@naparuba naparuba added Bug and removed Feature labels Nov 12, 2014
@naparuba naparuba added this to the 2.2 (Maleable Muskrat) **doc, cli & arbiters** milestone Nov 12, 2014
@naparuba
Copy link
Contributor

fix proposed, test in progress for no regression, but will only be sure that works well in the long run.

@pedrojimenez
Copy link
Author

Thanks @naparuba,
We are applying it and activating mod-canopsis again.

If there is any problem we will post it on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants