Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory dump using basic example shown in stdin documentation #4544

Closed
q2dg opened this issue Dec 27, 2021 · 15 comments
Closed

Memory dump using basic example shown in stdin documentation #4544

q2dg opened this issue Dec 27, 2021 · 15 comments

Comments

@q2dg
Copy link

q2dg commented Dec 27, 2021

Bug Report

Describe the bug

It is shown in this picture:

Captura de 2021-12-27 23-45-33

Your Environment

I've installed Fluentbit from source in a Fedora 35 system through git clone... (If I do fluent-bit -V I get v.1.9.0)
Thanks!

@nokute78
Copy link
Collaborator

nokute78 commented Jan 4, 2022

I built master and tested on Fedora 35
It worked correctly.

[taka@fedora build]$ cat /etc/redhat-release 
Fedora release 35 (Thirty Five)
[taka@fedora build]$ cat test.sh 
#!/bin/bash

while [[ true ]]; do
echo -n '{"clau" : "un valor"}'
sleep 1
done
[taka@fedora build]$ ./test.sh | bin/fluent-bit -i stdin -o stdout
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/01/04 16:18:52] [ info] [engine] started (pid=2517)
[2022/01/04 16:18:52] [ info] [storage] version=1.1.5, initializing...
[2022/01/04 16:18:52] [ info] [storage] in-memory
[2022/01/04 16:18:52] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/01/04 16:18:52] [ info] [cmetrics] version=0.2.2
[2022/01/04 16:18:52] [ info] [sp] stream processor started
[0] stdin.0: [1641280732.639314982, {"clau"=>"un valor"}]
[1] stdin.0: [1641280733.593922786, {"clau"=>"un valor"}]
[2] stdin.0: [1641280734.595366447, {"clau"=>"un valor"}]
[3] stdin.0: [1641280735.596630251, {"clau"=>"un valor"}]
[4] stdin.0: [1641280736.598423689, {"clau"=>"un valor"}]
^C[2022/01/04 16:18:57] [engine] caught signal (SIGINT)
[0] stdin.0: [1641280737.600338854, {"clau"=>"un valor"}]
[2022/01/04 16:18:57] [ warn] [engine] service will shutdown in max 5 seconds
[2022/01/04 16:18:57] [ warn] [input:stdin:stdin.0] end of file (stdin closed by remote end)
[2022/01/04 16:18:57] [ warn] [engine] service will shutdown in max 5 seconds
[2022/01/04 16:18:58] [ info] [engine] service has stopped (0 pending tasks)

@q2dg
Copy link
Author

q2dg commented Jan 4, 2022

Weill, I've tried downloading and compiling Fluentbit's source code again and I keep getting this error.
What I've done to build my own copy of Fluent-bit is this (maybe here is my error):

sudo dnf install git cmake flex bison gcc gcc-c++ systemd-devel
git clone https://github.com/fluent/fluent-bit
cd fluent-bit/build
cmake ../
make
sudo make install

Thanks for your interest.

@nokute78
Copy link
Collaborator

nokute78 commented Jan 8, 2022

@q2dg Thank you for information.
Your operation seems to be good.

I tested your operation, but I can't reproduce your issue...

@ptsneves
Copy link
Contributor

ptsneves commented Feb 2, 2022

@q2dg can you send your coredump ?

@q2dg
Copy link
Author

q2dg commented Feb 2, 2022

Yes, of course!
Here are coredump messages:

coredump-messages

I've uploaded the coredump file here. https://file.io/MCs38yXCH8Hw
Thanks!

@ptsneves
Copy link
Contributor

ptsneves commented Feb 3, 2022

Yes, of course! Here are coredump messages:

coredump-messages

I've uploaded the coredump file here. https://file.io/MCs38yXCH8Hw Thanks!

The file is already not available. Could you re-upload them with a longer expiration time?

It seems that it does not crash in a specific parser but probably while dereferencing parser->type as the last non-signal handler code ran is flb_parser_do which does nothing that should cause a sigsegv besides the parser-> dereference. Either that or the crash happened in another thread not visible in the print you sent :)

Can you compile with -fsanitize=address ?

@q2dg
Copy link
Author

q2dg commented Feb 3, 2022

Well, I don't know if I've done well or not (I don't know how Cmake works)...I've just added these lines in main CMakeLists.txt file:

set(CFLAGS -fsanitize=address)
set(CXXFLAGS -fsanitize=address)
set(LDFLAGS -fsanitize=address)

Anyway, the result is the same.

I reuploaded my coredump file here: https://www.mediafire.com/file/v8ys6ym1wbgyo1k/coredump.zst/file

Thanks

@ptsneves
Copy link
Contributor

ptsneves commented Feb 7, 2022

Hey, sorry for the delay and tick-tock but i am not used to help on machines i do not control and forgot some extra details:

Can you provide me with the unstripped binary you used to get the coredump? Make sure the binary is not stripped. If you build with -DFLB_DEV=On you should have a proper unstripped binary.

Also given that we cannot reproduce It makes me think this is some issue triggered by your environment. Have you tried to change your locale to for example C.UTF-8, and run the reproducer again?

@q2dg
Copy link
Author

q2dg commented Feb 7, 2022

Doing export LANG=C.UTF-8 before running FluentBit or having my standard environment it's the same: I get the same error
However, now the error message is different:

Captura de 2022-02-07 12-52-48

My fluent-bit binary is not stripped...:

Captura de 2022-02-07 13-02-41

Thanks a lot and sorry for the inconvenience.

@ptsneves
Copy link
Contributor

ptsneves commented Feb 7, 2022

Changing the locale led to a very strange outcome but, not helpful.
Ok so you need to upload your fluentbit binary as well. The coredump and the binary are 2 different things. You uploaded the coredump but i also need the unstripped fluent-bit binary.

No inconvenience. Just doing our best :)

@q2dg
Copy link
Author

q2dg commented Feb 7, 2022

Here it is: the binary file! https://www.mediafire.com/file/ykprv1vnc869qye/fluent-bit/file
Thanks!

ptsneves added a commit to ptsneves/fluent-bit that referenced this issue Feb 7, 2022
in_stdin_collect tests !ctx->parser to decide whether a parser
is associated with the context or not.

The problem with that check is ctx->parser is not explictily initialized
in in_stdin_init and the malloc allocation does not guarantee that the
memory assigned to ctx, and ctx->parser is zero initialized. This then
will lead to undefined behavior where sometimes the ctx->parser will not
be 0 and a non existing parser used. Errors like fluent#4544 will then pop up
randomly.

This fix was validated with valgrind and the example provided in fluent#4544
@ptsneves
Copy link
Contributor

ptsneves commented Feb 7, 2022

From the gdb session with your coredump and exe:

Reading symbols from /home/pneves/Downloads/fluent-bit...

warning: exec file is newer than core file.
[New LWP 10094]
[New LWP 10092]
[New LWP 10095]
Core was generated by `fluent-bit -i stdin -o stdout'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fe9e2b8d84c in ?? ()
[Current thread is 1 (LWP 10094)]
(gdb) info threads
  Id   Target Id         Frame 
* 1    LWP 10094         0x00007fe9e2b8d84c in ?? ()
  2    LWP 10092         0x00007fe9e2bd63a5 in ?? ()
  3    LWP 10095         0x00007fe9e2b88907 in ?? ()
(gdb) thread apply all bt

Thread 3 (LWP 10095):
#0  0x00007fe9e2b88907 in ?? ()
#1  0x00007fe9e2b90925 in ?? ()
#2  0x00007fe9dc004858 in ?? ()
#3  0x00000000e32a3200 in ?? ()
#4  0x0000000000000072 in ?? ()
#5  0x0000000000447a1a in flb_log_create (config=0x0, type=0, level=0, out=0x7fe9dc0047c0 "\n") at /home/usuari/fluent-bit/src/flb_log.c:225
#6  0x000000000046dba6 in flb_worker_context_create (func=0x7fe9dc005110, arg=0x0, config=0x46dba6 <flb_worker_context_create+124>) at /home/usuari/fluent-bit/src/flb_worker.c:62
#7  0x00007fe9e2b8ba87 in ?? ()
#8  0x0000000000000000 in ?? ()

Thread 2 (LWP 10092):
#0  0x00007fe9e2bd63a5 in ?? ()
#1  0x0000000000000000 in ?? ()

Thread 1 (LWP 10094):
#0  0x00007fe9e2b8d84c in ?? ()
#1  0x0000000000000000 in ?? ()

As you can see flb_log_create is passed a 0x0 config and type and level. A 0x0 config pointer will immediately lead to a crash and from what I see in the caller flb_log_create should be impossible. My conclusion is that there is some form of memory corruption leading to your crash, and the place where it crashes is meaningless.

Can you run ./test.sh | valgrind --tool=memcheck --trace-children=yes --track-origins=yes fluent-bit [...]

When i run this i get:

==1028853== Thread 2 flb-pipeline:
==1028853== Conditional jump or move depends on uninitialised value(s)
==1028853==    at 0x20E91D: in_stdin_collect (in_stdin.c:130)
==1028853==    by 0x17B322: flb_input_collector_fd (flb_input.c:1101)
==1028853==    by 0x19041B: flb_engine_handle_event (flb_engine.c:412)
==1028853==    by 0x19041B: flb_engine_start (flb_engine.c:704)
==1028853==    by 0x16E4B9: flb_lib_worker (flb_lib.c:626)
==1028853==    by 0x487C608: start_thread (pthread_create.c:477)
==1028853==    by 0x4E93292: clone (clone.S:95)
==1028853==  Uninitialised value was created by a heap allocation
==1028853==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1028853==    by 0x20E1ED: flb_malloc (flb_mem.h:55)
==1028853==    by 0x20F0A8: in_stdin_init (in_stdin.c:278)
==1028853==    by 0x179F6B: flb_input_instance_init (flb_input.c:562)
==1028853==    by 0x17A060: flb_input_init_all (flb_input.c:598)
==1028853==    by 0x18FF23: flb_engine_start (flb_engine.c:595)
==1028853==    by 0x16E4B9: flb_lib_worker (flb_lib.c:626)
==1028853==    by 0x487C608: start_thread (pthread_create.c:477)
==1028853==    by 0x4E93292: clone (clone.S:95)
==1028853== 

This looks like the path that lead to your crash and is an obvious undefined behavior because no initialization to 0 is done with malloc and code relies on it being 0.

I made a pull request at https://github.com/ptsneves/fluent-bit/pull/new/issue-4544
Can you build that code and tell me if this fixes your problem?

ptsneves added a commit to ptsneves/fluent-bit that referenced this issue Feb 7, 2022
in_stdin_collect tests !ctx->parser to decide whether a parser
is associated with the context or not.

The problem with that check is ctx->parser is not explictily initialized
in in_stdin_init and the malloc allocation does not guarantee that the
memory assigned to ctx, and ctx->parser is zero initialized. This then
will lead to undefined behavior where sometimes the ctx->parser will not
be 0 and a non existing parser used. Errors like fluent#4544 will then pop up
randomly.

This fix was validated with valgrind and the example provided in fluent#4544

Signed-off-by: Paulo Neves <[email protected]>
@q2dg
Copy link
Author

q2dg commented Feb 7, 2022

YES. IT WORKS!!
If you don't mind, I'll close the issue, then.
Thanks a lot!!

@q2dg q2dg closed this as completed Feb 7, 2022
@ptsneves
Copy link
Contributor

ptsneves commented Feb 7, 2022

Glad it works. Keep in mind this has not been merged yet, and i do not know what is the timeframe or if it will be merged at all :)

edsiper pushed a commit that referenced this issue Feb 8, 2022
in_stdin_collect tests !ctx->parser to decide whether a parser
is associated with the context or not.

The problem with that check is ctx->parser is not explictily initialized
in in_stdin_init and the malloc allocation does not guarantee that the
memory assigned to ctx, and ctx->parser is zero initialized. This then
will lead to undefined behavior where sometimes the ctx->parser will not
be 0 and a non existing parser used. Errors like #4544 will then pop up
randomly.

This fix was validated with valgrind and the example provided in #4544

Signed-off-by: Paulo Neves <[email protected]>
@ptsneves
Copy link
Contributor

ptsneves commented Feb 9, 2022

merged

patrick-stephens pushed a commit that referenced this issue Feb 9, 2022
in_stdin_collect tests !ctx->parser to decide whether a parser
is associated with the context or not.

The problem with that check is ctx->parser is not explictily initialized
in in_stdin_init and the malloc allocation does not guarantee that the
memory assigned to ctx, and ctx->parser is zero initialized. This then
will lead to undefined behavior where sometimes the ctx->parser will not
be 0 and a non existing parser used. Errors like #4544 will then pop up
randomly.

This fix was validated with valgrind and the example provided in #4544

Signed-off-by: Paulo Neves <[email protected]>
Signed-off-by: Patrick Stephens <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants