-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logrotate issues since 2.10.1-1 #6737
Comments
A description of the circumstances to reproduce this is required. You say it happens during log rotate, what makes you think that? The log around this time would also be very welcome. |
We've got a stock Debian cron and logrotate setup, around 200 Hosts and randomly one or two of those hosts has this issue. Logrotate in in cron.daily, which gets executed at 06:25: 25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) /var/log/icinga2# ls -l The crash report is from that time: stat /var/log/icinga2/crash/report.1540790715.016844File: /var/log/icinga2/crash/report.1540790715.016844 And the last entry of the log files are: [2018-10-29 06:23:17 +0100] information/WorkQueue: #14 (JsonRpcConnection, #5) items: 0, rate: 0/s (0/min 0/5min 0/15min); Do you need any further information? |
I think i got the same problem on a machine i host.
Last log lines:
I happens randomly on other machine, just like @sschindel, but the stacktrace is not exactly the same... Can provide more info if it can help. Oh, and about configuration, nothing has been manually changed. Only launched a Thank you very much. Edit: High probability this is a double of #6736 |
Are you using SysVinit or Systemd here? I'm thinking about a relationship to #6356. |
If it is just Debian 9, #6571 could come into play as well. |
@dnsmichi : Standard Debian Stretch install, so systemd :-) Hope it helps. |
Hi, we are also using Debian Stretch and had no issues with 2.9.2-1. This issue only appeared since upgrading to 2.10.1-1. We are using the packages from packges.icinga.com: deb http://packages.icinga.com/debian icinga-stretch main |
I'm still using stable deb source.list, so still using the sometimes-crashing-on-my-vm version. It did not happen for a week, but then it happened today again when We host about 40 VMs and hosts (35 Debian Stretch (Systemd), 1 Debian Jessie (Systemd), rest is Ubuntu (Systemd)), and only 1 is posing problems. Exactly the same time as previously (6:25 AM CET, that is to say 5:25 AM UTC, some machines aren't on the same timezone, but i've taken care to take the correct log lines). Here is crash report:
Here are the last
Here is the
Here is the
Hope it helps. To dev staff: Here to help/test things if needed ;-) Edit: added |
This looks like a wrong file descriptor handle, i.e. logrotate moves something away where Icinga still wants to write to. I suspect the close-stdio changes and flushing the streams to cause trouble here. If anyone has time to lookup the code changes, I‘m out for OSMC this week. |
Hi, I also have this issue, please see my crash log as well. This also includes the gdb output:
After the crash a new icinga2.log is already created:
But as you can see nothing is written to it yet. When I start up the Icinga2 daemon again after the crash it starts writing to |
Can you try the snapshot packages? In #6736 a bug was fixed which seems similar to this. |
Hello @mcktr, |
@mcktr the changes you are referring to are solely cluster related, nothing which points to the stream logger class. I've fixed a possible double free as see in the gdb stacktrace yesterday, please test them in the snapshot packages once being built. On a related note, right now the master solely contains bugfixes and no radical changes, so it is safe to assume it works. |
Thus far, I am not able to reproduce this with 2.10.1 on Debian Stretch.
|
Has this already been released? There was an update some days ago, but this one still does not fix that issue. |
@sschindel Same, crashed this morning as well. |
Please add more logs from such crashes and everything else which helps reproduce the issue. I wasn't able to reproduce this yet. |
Today's logs File
File
File
File
File
File
|
Does 2.10.2 change something here? |
@dnsmichi : Doesn't seem so, because it crashed this morning. How to reproduce: good question. Installed Debian Stretch, followed the GrayLog setup tutorial. |
|
Thanks for your thoughts and input. I can only offer you to test things already available, everything else is work in progress. I tend to disagree on "should be simple to fix", as the analysis in the issue has proven that the problem exists, but is not an easy one to reproduce. Since you're experiencing this in your environment, and obviously have some more deep knowledge about C++ and signal handling, it might be a good effort to patch this and send in a PR with analysis before and after. I'm not sure how the UTF8 things relate here, that's just a different issue having been fixed with the JSON library replacement. The main thing is that memory corruption and leaks are hard to find, and at the time this crashes, they may have influenced this but the root cause lies in a different scope. I don't believe that this is the true cause for logrotate either, but testing the removed YAJL problem allows us to mitigate the problem further. The stacktrace on the crash will definitely look different, and hopefully gives more insights on this. Therefore I am asking you and everyone else experiencing the problem to test the snapshot packages. If we don't find anything here in a reasonable amount of time, there's two options: a) Block 2.11 release, meaning to say everyone else that's waiting for the network stack etc. b) Remove this from 2.11 and leave it for later. That's bad for everyone else here. Cheers, |
Regarding not wanting to test this: It's just not really testable for us because the last one, aborting in yayl, is not common in our environment and as the other ones. And if it is a separate issue, it thus had lower priority. For the crashes that are reported in this issue I did a verification that it is related to unsafe refcounting. And I made one clean fix/patch changing as less as possible just fixing the issue with only a small "interface" change (OnTimerExpired callback does not return a Ptr object with reference but just a plain pointer - and thus had to be careful not to increment it's refcount or copy it byside for later use without knowing the object is still alive then - but this parameter to OnTimerExpired is nowhere used beside test case). One could prefer to do a bigger rework to keep the Timer-Scheduler keeping references on its Timers to not have the interface change described above, but that's not as simple. Sure I could do if that would be Your requirement for a fix but it would be some efford I would only spent if really needed. Sure I can put it in a PR with analysis before and after - the one I have. I could also try prove whether the yayl abort is reproducible if synchronizing that part to the other I already synchronized for the verification in such a way that the malloc from yayl occurs after one of the parties finished destruction completely (freed memory) but before the other one does the same and raises the double-free SEGV (so the one that is not the first destructor waits for a nitification from yayl). So what's Your requirement? Just a PR from what I already did or does it need to verify whether the yayl error is covered or is it required to provide OnTimerExpired callback with a Ptr object with valid refcount while at the same time keeping the auto-desctruction behaviour of the Timer objects which would make a fix much more complex? I don't have a deep knowledge of C++ and signal-handling is also nothing I'm really deep in. But this is not really related to signal handling that far and I have quite deep understanding of programming/threading logic which is the same across all languages at the end/behind the curtins for the hardware we speak about (we don't speak about quantum-computers and such stuff where I expect some major differences). So for the other Issues I can comment there (a bit shorter). |
review done - #7130 does not change anything and is not addressing rootcause. the root-cause is fixed with #7129. if you want a workarounf like that 7130 which does not address the root-cause I posted something that is working instead in comment above: As I'm not a fan to not fix the root-cause but just implement a local workaround I will not make a PR for that. But You could indeed make a merge of what's show there (it's already tested - and for the problem observiced within the issue the second one change would be the least needed - but then one more place stays unfixed....) |
Hi, thanks for the critical section patch to force the application crash, this really helped in analysing all the possible patches. Here's what we've tested today.
NothingGit Master, 02db12a Result: Segfault. Only initialize Flush Timer onceFrom #7130. Result: Icinga doesn't crash, but the log flush is running into a lock, enforced by the CS patch and different circumstances. Timer const*From #7129. Result: Icinga doesn't crash and continues to log on stdout and mainlog. Combine flush timer once with timer const*Result: Works fine, doesn't have any other influences. Conclusion, first partUsing a const pointer instead of an intrusive ptr fixes the problem, still the flush timer "re-newal" is totally unnecessary and needs to be fixed as well. lambda function instead of std::bindFrom #7004. Result: Doesn't fix, crashes too. lambda function and flushResult: Doesn't crash. Conclusion, second partSomehow std::bind is involved here with intrusive_ptr, starting with the changes in 2.9. A const* pointer or using a lambda bind is effectively keeping the pointer reference, and the call continues, no application crashes involved. |
After some research within #7006 I've come to the following conclusion:
You can avoid the intrusive_ptr going out of scope by either making this a const raw pointer, or by passing its reference into a lambda function. Since the Timer class explicitly uses raw pointers instead of intrusive_ptr's, I'd stay there and combine the three patches with 1) const Timer* 2) lambdas 3) avoid re-newing the timer. In terms of moving away from std::bind to lambdas, follow my research here: #7006 (comment) |
…ashes at logrotate time this is a direct fix of the issue revealing the problem that leads to crash verification done with a patched icinga2 where the execution-order of the code lines of counter-parts involved in re-incrementing/decrementing Timer:Ptr is forced to be the one that leads to the obeserverd segfaults refs Icinga#6737
Snapshot packages for Debian/Ubuntu need to be fixed in our build infra, the rest is available. |
ref/IP/13853 |
…ashes at logrotate time this is a direct fix of the issue revealing the problem that leads to crash verification done with a patched icinga2 where the execution-order of the code lines of counter-parts involved in re-incrementing/decrementing Timer:Ptr is forced to be the one that leads to the obeserverd segfaults refs #6737 (cherry picked from commit 52e3db2)
Hey!
Since upgrading to 2.10.1-1 we face some issues with the logrotating part. The daemon sometime crashes after the logrotate run, this did not happen before. This is the crash report:
cat /var/log/icinga2/crash/report.1540790715.016844
Application version: r2.10.1-1
System information:
Platform: Debian GNU/Linux
Platform version: 9 (stretch)
Kernel: Linux
Kernel version: 4.9.0-6-amd64
Architecture: x86_64
Build information:
Compiler: GNU 6.3.0
Build host: 1880c6977367
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
Stacktrace:
Failed to launch GDB: No such file or directory
The text was updated successfully, but these errors were encountered: