-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV on 1.9.4 with stack trace #5753
Comments
Report from my customer: Using fluent/fluent-bit:1.9.6-debug to test, also reproduced the bug.
The following content is using fluent/fluent-bit:1.9.6 to test:
|
Can you upload the coredump? I would like to see where exactly the 0x0 is passed [as an argument]. That can hint at the issue. Also please edit the traces so it is not in json format but in plain text :) |
@ptsneves We have not been able to get a core dump yet, will post it if we get one. |
I did get a core dump, but its a huge, half a GB, and when I try to read it, I don't get much:
@ptsneves Do you happen to know if there's likely reason for this? It is just because I need to try installing the separate debug info for each of these packages? |
Now that I installed all of the debug infos, I get:
Which is still not something I know how what to do with... |
@PettitWesley You must have not installed the fluent-bit debug symbols as the only symbols are gettextlex. I believe if you provide them we can get closer. It is a clear null pointer derreference. |
Alrighty so we installed all the debuginfos and then got it to pick up the raw fluent bit code and dumped the stack for all threads. Unfortunately I am still struggling to understand what's going on here:
and here is the other core dump:
|
@ptsneves Do you have any idea about this issue? We run into the same issue. |
@JoeShi no, did not have time yet. Sorry |
@PettitWesley is the SEGFAULT reproducible if you set |
@tarruda in 1.9.4 S3 already had default 1 workers enabled, as does Kinesis. So this was reproduced with 1 worker. |
@PettitWesley do you think you could create a simple fluent-bit config that reproduces the segfault? It would help me a lot in debugging this |
@tarruda the only config I have repro'd it with is basically what is shown in the issue. Since I don't know what causes it I didn't try removing anything from the config. |
This bit from another stack trace is probably the key bit:
So it seems to always be failing here: https://github.com/fluent/fluent-bit/blob/v1.9.4/src/flb_input_chunk.c#L1123 This strongly suggests that the flb_input_chunk pointer is pointing to an address that was freed. So its like somehow the input chunk is already used and freed by output but then the input still thinks its valid and is trying to append new data to it. |
We concluded our Root Cause Analysis of the segfault issue. When does the problem happenThe problem arises when tags are used in fluent bit with a length greater than 256 characters, and durable filesystem log buffering is opted into. The problemFluent bit backs log data and tag information to files on disk using a library called chunk io. Chunk io fails to properly store tag metadata length in the file due to the following line of code:
in cio_file_st_set_meta_len(), which is supposed to store the tag's length. This line of code will only fail if len is greater than 256. Due to improper type casting, the expression (uint8_t) len >> 8 will always evaluate to 0, because len is casted to an 8 bit int before bit shifted 8 bits right. ImpactUpon retrieving the chunk, tag length is mod(the true tag_len, 256). This becomes impactful with the following workflow.
--
Upon using this destroyed chunk, retrieved from the cache, a segfault occurs SolutionThe following line
can be changed to
This will allow for successful storage of the full tag, and proper cache removals on chunk deletion, thus preventing access to a destroyed chunk. TimelineAfter searching through the library code base, we discovered that this problem was separately identified and resolved 2 months ago in by Leonardo Albertovitch ChunkIO. He also changed
to
in the following PR: https://github.com/calyptia/chunkio/pull/86/files#diff-f0d2cad50811f32fa059f311467edde4ae20b686d7eac480307fe30567d7548eR78-R79 Fluent Bit Master and 1.9 both do not have the updated chunkIO code and thus still suffer from segfault. ActionPing the Fluent Bit core developers driving chunkIO code to be updated. Work aroundsEnsure the tag length is less than 256 characters. Testing in progressCurrently load tests are being run via FireLens datajet to validate that the segfault no longer occurs after the above code change is added. |
@matthewfala thanks for the detailed investigation and report, we have passed the information to chunkio devs |
@lecaros we discussed with Eduardo today. We need this bug fix to be merged into 1.9.9 release, as it affects many customers. |
Please double check with @edsiper. |
The solution which is the updated chunkio dependancy is merged to master, but not yet merged to 1.9. Please see: Looks like we're on track to release in 2.0, but it would be great to get this back-ported into 1.9 |
Hi @lubingfeng and @matthewfala, |
echo @matthewfala's analysis, based on my coredump, I can really see the tag length is over 256. Great finding!
Check full log below
|
1.9.9 (to be released) has been patched by upgrading to Chunk I/O v1.3.0 . thanks everyone for working on this |
There is any news about release 1.9.9? |
Anybody seeing this issue with the 2.0.9 release? I deployed the 2.0.9 release to openshift. Pods constantly crash. [2023/03/13 21:26:59] [debug] [input:tail:tail.1] [static files] processed 32.0K |
Yes, that bug has been fixed in the upcoming version 2.0.10 |
While waiting for 2.0.10, should I roll back to an earlier version like 1.9.9? |
Bug Report
Describe the bug
SIGSEGV and crash/pod restart. We got a stack trace from Valgrind.
Configuration
Your Environment
Amazon EKS on Amazon EC2 on Amazon Linux 2
Additional context
The text was updated successfully, but these errors were encountered: