Memory dump using basic example shown in stdin documentation #4544

q2dg · 2021-12-27T22:50:52Z

Bug Report

Describe the bug

It is shown in this picture:

Your Environment

I've installed Fluentbit from source in a Fedora 35 system through git clone... (If I do fluent-bit -V I get v.1.9.0)
Thanks!

nokute78 · 2022-01-04T07:20:59Z

I built master and tested on Fedora 35
It worked correctly.

[taka@fedora build]$ cat /etc/redhat-release 
Fedora release 35 (Thirty Five)
[taka@fedora build]$ cat test.sh 
#!/bin/bash

while [[ true ]]; do
echo -n '{"clau" : "un valor"}'
sleep 1
done
[taka@fedora build]$ ./test.sh | bin/fluent-bit -i stdin -o stdout
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/01/04 16:18:52] [ info] [engine] started (pid=2517)
[2022/01/04 16:18:52] [ info] [storage] version=1.1.5, initializing...
[2022/01/04 16:18:52] [ info] [storage] in-memory
[2022/01/04 16:18:52] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/01/04 16:18:52] [ info] [cmetrics] version=0.2.2
[2022/01/04 16:18:52] [ info] [sp] stream processor started
[0] stdin.0: [1641280732.639314982, {"clau"=>"un valor"}]
[1] stdin.0: [1641280733.593922786, {"clau"=>"un valor"}]
[2] stdin.0: [1641280734.595366447, {"clau"=>"un valor"}]
[3] stdin.0: [1641280735.596630251, {"clau"=>"un valor"}]
[4] stdin.0: [1641280736.598423689, {"clau"=>"un valor"}]
^C[2022/01/04 16:18:57] [engine] caught signal (SIGINT)
[0] stdin.0: [1641280737.600338854, {"clau"=>"un valor"}]
[2022/01/04 16:18:57] [ warn] [engine] service will shutdown in max 5 seconds
[2022/01/04 16:18:57] [ warn] [input:stdin:stdin.0] end of file (stdin closed by remote end)
[2022/01/04 16:18:57] [ warn] [engine] service will shutdown in max 5 seconds
[2022/01/04 16:18:58] [ info] [engine] service has stopped (0 pending tasks)

q2dg · 2022-01-04T09:35:02Z

Weill, I've tried downloading and compiling Fluentbit's source code again and I keep getting this error.
What I've done to build my own copy of Fluent-bit is this (maybe here is my error):

sudo dnf install git cmake flex bison gcc gcc-c++ systemd-devel
git clone https://github.com/fluent/fluent-bit
cd fluent-bit/build
cmake ../
make
sudo make install

Thanks for your interest.

nokute78 · 2022-01-08T02:40:53Z

@q2dg Thank you for information.
Your operation seems to be good.

I tested your operation, but I can't reproduce your issue...

ptsneves · 2022-02-02T14:16:25Z

@q2dg can you send your coredump ?

q2dg · 2022-02-02T16:15:17Z

Yes, of course!
Here are coredump messages:

I've uploaded the coredump file here. https://file.io/MCs38yXCH8Hw
Thanks!

ptsneves · 2022-02-03T10:33:42Z

Yes, of course! Here are coredump messages:

I've uploaded the coredump file here. https://file.io/MCs38yXCH8Hw Thanks!

The file is already not available. Could you re-upload them with a longer expiration time?

It seems that it does not crash in a specific parser but probably while dereferencing parser->type as the last non-signal handler code ran is flb_parser_do which does nothing that should cause a sigsegv besides the parser-> dereference. Either that or the crash happened in another thread not visible in the print you sent :)

Can you compile with -fsanitize=address ?

q2dg · 2022-02-03T15:11:11Z

Well, I don't know if I've done well or not (I don't know how Cmake works)...I've just added these lines in main CMakeLists.txt file:

set(CFLAGS -fsanitize=address)
set(CXXFLAGS -fsanitize=address)
set(LDFLAGS -fsanitize=address)

Anyway, the result is the same.

I reuploaded my coredump file here: https://www.mediafire.com/file/v8ys6ym1wbgyo1k/coredump.zst/file

Thanks

ptsneves · 2022-02-07T11:41:07Z

Hey, sorry for the delay and tick-tock but i am not used to help on machines i do not control and forgot some extra details:

Can you provide me with the unstripped binary you used to get the coredump? Make sure the binary is not stripped. If you build with -DFLB_DEV=On you should have a proper unstripped binary.

Also given that we cannot reproduce It makes me think this is some issue triggered by your environment. Have you tried to change your locale to for example C.UTF-8, and run the reproducer again?

q2dg · 2022-02-07T12:03:16Z

Doing export LANG=C.UTF-8 before running FluentBit or having my standard environment it's the same: I get the same error
However, now the error message is different:

My fluent-bit binary is not stripped...:

Thanks a lot and sorry for the inconvenience.

ptsneves · 2022-02-07T12:06:54Z

Changing the locale led to a very strange outcome but, not helpful.
Ok so you need to upload your fluentbit binary as well. The coredump and the binary are 2 different things. You uploaded the coredump but i also need the unstripped fluent-bit binary.

No inconvenience. Just doing our best :)

q2dg · 2022-02-07T13:14:30Z

Here it is: the binary file! https://www.mediafire.com/file/ykprv1vnc869qye/fluent-bit/file
Thanks!

in_stdin_collect tests !ctx->parser to decide whether a parser is associated with the context or not. The problem with that check is ctx->parser is not explictily initialized in in_stdin_init and the malloc allocation does not guarantee that the memory assigned to ctx, and ctx->parser is zero initialized. This then will lead to undefined behavior where sometimes the ctx->parser will not be 0 and a non existing parser used. Errors like fluent#4544 will then pop up randomly. This fix was validated with valgrind and the example provided in fluent#4544

ptsneves · 2022-02-07T18:21:36Z

From the gdb session with your coredump and exe:

Reading symbols from /home/pneves/Downloads/fluent-bit...

warning: exec file is newer than core file.
[New LWP 10094]
[New LWP 10092]
[New LWP 10095]
Core was generated by `fluent-bit -i stdin -o stdout'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fe9e2b8d84c in ?? ()
[Current thread is 1 (LWP 10094)]
(gdb) info threads
  Id   Target Id         Frame 
* 1    LWP 10094         0x00007fe9e2b8d84c in ?? ()
  2    LWP 10092         0x00007fe9e2bd63a5 in ?? ()
  3    LWP 10095         0x00007fe9e2b88907 in ?? ()
(gdb) thread apply all bt

Thread 3 (LWP 10095):
#0  0x00007fe9e2b88907 in ?? ()
#1  0x00007fe9e2b90925 in ?? ()
#2  0x00007fe9dc004858 in ?? ()
#3  0x00000000e32a3200 in ?? ()
#4  0x0000000000000072 in ?? ()
#5  0x0000000000447a1a in flb_log_create (config=0x0, type=0, level=0, out=0x7fe9dc0047c0 "\n") at /home/usuari/fluent-bit/src/flb_log.c:225
#6  0x000000000046dba6 in flb_worker_context_create (func=0x7fe9dc005110, arg=0x0, config=0x46dba6 <flb_worker_context_create+124>) at /home/usuari/fluent-bit/src/flb_worker.c:62
#7  0x00007fe9e2b8ba87 in ?? ()
#8  0x0000000000000000 in ?? ()

Thread 2 (LWP 10092):
#0  0x00007fe9e2bd63a5 in ?? ()
#1  0x0000000000000000 in ?? ()

Thread 1 (LWP 10094):
#0  0x00007fe9e2b8d84c in ?? ()
#1  0x0000000000000000 in ?? ()

As you can see flb_log_create is passed a 0x0 config and type and level. A 0x0 config pointer will immediately lead to a crash and from what I see in the caller flb_log_create should be impossible. My conclusion is that there is some form of memory corruption leading to your crash, and the place where it crashes is meaningless.

Can you run ./test.sh | valgrind --tool=memcheck --trace-children=yes --track-origins=yes fluent-bit [...]

When i run this i get:

==1028853== Thread 2 flb-pipeline:
==1028853== Conditional jump or move depends on uninitialised value(s)
==1028853==    at 0x20E91D: in_stdin_collect (in_stdin.c:130)
==1028853==    by 0x17B322: flb_input_collector_fd (flb_input.c:1101)
==1028853==    by 0x19041B: flb_engine_handle_event (flb_engine.c:412)
==1028853==    by 0x19041B: flb_engine_start (flb_engine.c:704)
==1028853==    by 0x16E4B9: flb_lib_worker (flb_lib.c:626)
==1028853==    by 0x487C608: start_thread (pthread_create.c:477)
==1028853==    by 0x4E93292: clone (clone.S:95)
==1028853==  Uninitialised value was created by a heap allocation
==1028853==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1028853==    by 0x20E1ED: flb_malloc (flb_mem.h:55)
==1028853==    by 0x20F0A8: in_stdin_init (in_stdin.c:278)
==1028853==    by 0x179F6B: flb_input_instance_init (flb_input.c:562)
==1028853==    by 0x17A060: flb_input_init_all (flb_input.c:598)
==1028853==    by 0x18FF23: flb_engine_start (flb_engine.c:595)
==1028853==    by 0x16E4B9: flb_lib_worker (flb_lib.c:626)
==1028853==    by 0x487C608: start_thread (pthread_create.c:477)
==1028853==    by 0x4E93292: clone (clone.S:95)
==1028853==

This looks like the path that lead to your crash and is an obvious undefined behavior because no initialization to 0 is done with malloc and code relies on it being 0.

I made a pull request at https://github.com/ptsneves/fluent-bit/pull/new/issue-4544
Can you build that code and tell me if this fixes your problem?

in_stdin_collect tests !ctx->parser to decide whether a parser is associated with the context or not. The problem with that check is ctx->parser is not explictily initialized in in_stdin_init and the malloc allocation does not guarantee that the memory assigned to ctx, and ctx->parser is zero initialized. This then will lead to undefined behavior where sometimes the ctx->parser will not be 0 and a non existing parser used. Errors like fluent#4544 will then pop up randomly. This fix was validated with valgrind and the example provided in fluent#4544 Signed-off-by: Paulo Neves <[email protected]>

q2dg · 2022-02-07T18:41:02Z

YES. IT WORKS!!
If you don't mind, I'll close the issue, then.
Thanks a lot!!

ptsneves · 2022-02-07T19:25:35Z

Glad it works. Keep in mind this has not been merged yet, and i do not know what is the timeframe or if it will be merged at all :)

in_stdin_collect tests !ctx->parser to decide whether a parser is associated with the context or not. The problem with that check is ctx->parser is not explictily initialized in in_stdin_init and the malloc allocation does not guarantee that the memory assigned to ctx, and ctx->parser is zero initialized. This then will lead to undefined behavior where sometimes the ctx->parser will not be 0 and a non existing parser used. Errors like #4544 will then pop up randomly. This fix was validated with valgrind and the example provided in #4544 Signed-off-by: Paulo Neves <[email protected]>

ptsneves · 2022-02-09T08:20:35Z

merged

in_stdin_collect tests !ctx->parser to decide whether a parser is associated with the context or not. The problem with that check is ctx->parser is not explictily initialized in in_stdin_init and the malloc allocation does not guarantee that the memory assigned to ctx, and ctx->parser is zero initialized. This then will lead to undefined behavior where sometimes the ctx->parser will not be 0 and a non existing parser used. Errors like #4544 will then pop up randomly. This fix was validated with valgrind and the example provided in #4544 Signed-off-by: Paulo Neves <[email protected]> Signed-off-by: Patrick Stephens <[email protected]>

ptsneves mentioned this issue Feb 7, 2022

in stdin: Initialize memory to 0 #4761

Merged

q2dg closed this as completed Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory dump using basic example shown in stdin documentation #4544

Memory dump using basic example shown in stdin documentation #4544

q2dg commented Dec 27, 2021 •

edited

Loading

nokute78 commented Jan 4, 2022

q2dg commented Jan 4, 2022

nokute78 commented Jan 8, 2022

ptsneves commented Feb 2, 2022

q2dg commented Feb 2, 2022

ptsneves commented Feb 3, 2022 •

edited

Loading

q2dg commented Feb 3, 2022

ptsneves commented Feb 7, 2022

q2dg commented Feb 7, 2022

ptsneves commented Feb 7, 2022

q2dg commented Feb 7, 2022

ptsneves commented Feb 7, 2022

q2dg commented Feb 7, 2022

ptsneves commented Feb 7, 2022

ptsneves commented Feb 9, 2022

Memory dump using basic example shown in stdin documentation #4544

Memory dump using basic example shown in stdin documentation #4544

Comments

q2dg commented Dec 27, 2021 • edited Loading

Bug Report

nokute78 commented Jan 4, 2022

q2dg commented Jan 4, 2022

nokute78 commented Jan 8, 2022

ptsneves commented Feb 2, 2022

q2dg commented Feb 2, 2022

ptsneves commented Feb 3, 2022 • edited Loading

q2dg commented Feb 3, 2022

ptsneves commented Feb 7, 2022

q2dg commented Feb 7, 2022

ptsneves commented Feb 7, 2022

q2dg commented Feb 7, 2022

ptsneves commented Feb 7, 2022

q2dg commented Feb 7, 2022

ptsneves commented Feb 7, 2022

ptsneves commented Feb 9, 2022

q2dg commented Dec 27, 2021 •

edited

Loading

ptsneves commented Feb 3, 2022 •

edited

Loading