-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broker crashed due unrecoverable error after "system time change" and "Compensating ..." #1350
Comments
is your system really loosing time? On Fri, Oct 24, 2014 at 9:16 AM, Pedro Jiménez Solís <
|
Hi @naparuba,
|
Broker modules loaded: webui,livestatus,ui-pnp,npcdmod,canopsis We are going to disable the canopsis module and try, because I guess there is a grave problem with the shinken queue management in mod-canopsis. I would appreciate if you could take a look at shinken-monitoring/mod-canopsis#9 in case these two issues were related. |
It seems that the problem is in mod-canopsis as I guess. No crashes since we disable the module. |
Update: We will keep the info updated. Regards. |
Looking at the tracback: File "/usr/local/lib/python2.7/dist-packages/shinken/daemons/brokerdaemon.py", line 633, in do_loop_turn it's "clear" to me that a broker module has "died" (its queue is "EOF" , meaning the queue has been closed by the remote side). the question is : how to react on that at this level.. TBC. |
Just about : > After the error, the broker daemon still remains as pid what did you mean "as pid" ? the process was still alive ? (should not imo). anyway, the broker could effectively catch this error and try to relaunch the failed module, if feasible.. |
I think @pedrojimenez didn't notice you in the issue that there is an "out of memory" problem between the "system time change" and the consecuent python exception. When this problem occurs, broker consumes more and more memory, until the system ran out of ram and swap. At this moment, I think the linux kernel is responsible of closing the queue to protect the system. When the exceptions raises, the memory and swap usage is released. Anyway, you need to restart the broker in order to get a fully functional environment. We have seen and verify this behaviour plenty of times (one per night) checking a munin installed in the system. The RRD graphs generated are self-explanatory: the time of the excepction is the same time that the out of memory event occurs, followed by the memory release. That's my guess, I hope it can help. |
Yes, the broker process doesn't die, but it doesn't write logs and doesn't communicate with other modules, it's like a zombie process that you need to restart if you want it to work again. |
ok i (guess to) see more clearly. Anyway : if the broker process wasn't died/exited after the exception then it means some of its modules are daemon threads, more than probably.. I'm quite sure this shouldn't be case though : we want the broker process to be fully exited/terminated (and all its modules !) when the main thread exit (normally or not). If i found some time I'll look at the modules that were used about this.
The out-of-memory could probably have killed the canopsis (or another one) module -> as you say its queue is then closed by kernel -> EOF when broker try to send/write to it. The side question is : what was causing the out-of-memory. |
The "Compensating system time" seems to be the trigger and the shinken-monitoring/mod-canopsis#9 indicates a problem with the queue management. If you want us to try something or need more info, please tell us. |
well the retention saving part is 1 second tick based so there is a tick are you sure the canopsis module connect to canopsis amqp daemon to write if you find some of those messages then you have a problem connecting
|
Hi @dguenault, Yes, I'm pretty sure that canopsis module is connecting to canopsis amqp and writting events (service events by default, host events applying the shinken-monitoring/mod-canopsis#9 patch) |
did you notice any of the messages in logs ? 2014-11-06 8:54 GMT+01:00 David Gil [email protected]:
|
I have notice those logs in the past due communication problems, but not at the moment of the "system change" error. In addition, the broker log is full of these messages (generated by the try/Except in the pull request)
|
mmmh need to build a lab to test all of this. The code is not really clear 2014-11-06 9:11 GMT+01:00 David Gil [email protected]:
|
fix proposed, test in progress for no regression, but will only be sure that works well in the long run. |
Thanks @naparuba, If there is any problem we will post it on this thread. |
We have our Server (Shinken 2.0.3) running in a VMware virtual machine with VMware Time Sync disabled, doing the sync via NTP client by the Operating System (Ubuntu 12.04). We are getting a Compesating message everyday and the broker daemon get crashed.
This is the event:
After a while the broker daemon threw these entries in the log:
After the error, the broker daemon still remains as pid, but all communication from/to broker are lost. The arbiter is looping forever trying to find the broker.
We have tried to test this behaviour with/without VMwareTools. It happens in both cases.
The text was updated successfully, but these errors were encountered: