-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent SIGSEGV with Fluent Bit 1.8.7 #4164
Comments
I managed to capture a core dump. Fluent Bit reported:
And
|
Hi Lionel, we'll get in touch with you soon to troubleshoot this. Thanks for the report! |
It seems that the problem is not specific to 1.8.7. After downgrading to 1.8.6 I got:
|
We just experienced this in production after deploying an update which otherwise passed pre-production. It seems like load and/or message format might play a part in this. We are still investigating but if anything obvious comes out of our investigation I'll try to give more details here. |
While I was testing different versions because of another bug, the earliest version where I saw SIGSEGV was 1.8.2, after pressing Ctrl-C. Unfortunately I only have the following small log snippet saved. This happened on Windows Server 2019.
If log format has something to do with this, I might be able to provide a redacted log example, if necessary. |
it looks like there is corruption, that's why stack traces might look different. Do you have access to the specific file it was being processed when the crash occurred ? |
We wondered if there was some issue that was causing memory exhaustion. In our case, we noticed that the logs were not ending up in Splunk. In this case, we wonder if the logs were buffered and eventually exceeded available memory. |
Looking at a recent crash, I see 188 leftover files in the On that machine here is the
We see Looking at another recent crash, I see a different tag corresponding to a different file being processed. |
In case it helps, here is the second backtrace:
Tag |
FWIW, here is another backtrace, almost identical to the previous one but with yet another file (this time Note that the chunk is NULL in
|
In case it helps to find the culprit, I've noticed that SELinux complains about
|
FWIW, I have downgraded once more (this time to 1.8.5) and I still see these crashes. So the problem is present at least in 1.8.5, 1.8.6 and 1.8.7... |
#4164 (comment) is similar to aws/aws-for-fluent-bit#255 (comment) and #4137. The patch #4187 may fix this issue. It is merged from v1.8.9.
|
Maybe this indeed fixes this problem but I cannot test 1.8.9 on our production machines because of #4255... |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
FWIW, none of the nodes that I have upgraded to 1.8.10 have crashed so far. So it seems that this problem has indeed been fixed (for my use case). |
@LionelCons thanks for the update. Please close this ticket if you think is safe to do it. |
OK, closing then... |
Well, I have upgraded to 1.8.11 and I now see the problem again. So either the problem was not fixed or there is a new problem that looks very similar. Here is what a 1.8.11 stack trace looks like:
|
@LionelCons since the original issue is faced in a Lua filter this looks different, e.g:
do you have the full config and log file associated ? |
FWIW, downgrading to 1.8.10 makes the problem go away so maybe the recent changes should be looked at. Regarding the configuration files, they cannot be posted here but I can send them to you by email. |
Note: #4164 (comment) is similar to #3412 . (The issue is closed but not fixed) |
Looks like the same or a similar issue here, 1.8.11....
|
Looking at I can't help wonder if some TLS is not being copied/clobbered correctly. |
FWIW, I have tried with 1.8.12 and the problem is still present... |
I can able to reproduce the issue in 1.9.0 as well |
@bharathiram Could you please describe how you can reproduce the problem? |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
I am also facing a similar issue. My fluent-bit version is 1.8.13. I am getting this error, when I try to terminate the fluent-bit container by using the 'kill' command in a bash script. |
Hi, do anyone knows if this was fixed in some fluent-bit version? We are facing this one using 1.8.11 |
@dosten Don't know if the issue was fixed, but we moved to v2.0.3 and haven't faced this issue. |
After having upgraded my machines to Fluent Bit 1.8.7, I see frequent crashes with SIGSEGV.
Weirdly, the stack traces vary quite a lot. Here are some examples.
The text was updated successfully, but these errors were encountered: