Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: reload: bin: Add opt-out option to ensure thread safety on hot reloading #7509

Merged
merged 2 commits into from
Jun 13, 2023

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Jun 1, 2023

Because still remaining tasks exist, fluent-bit wrongly proceeds to handle left tasks that are the invalid statuses.
This causes the following SEGV issue:

[2023/06/01 16:49:04] [error] % Failed to produce to topic test: Local: Queue full

[2023/06/01 16:49:04] [ warn] [output:kafka:kafka.0] internal queue is full, retrying in one second
[2023/06/01 16:49:04] [ info] [task] tail/tail.0 has 2 pending task(s):
[2023/06/01 16:49:04] [ info] [task]   task_id=2 still running on route(s): kafka/kafka.0 
[2023/06/01 16:49:04] [ info] [task]   task_id=3 still running on route(s): kafka/kafka.0 
[2023/06/01 16:49:04] [ info] [task] storage_backlog/storage_backlog.1 has 0 pending task(s):
[2023/06/01 16:49:04] [ info] [engine] service has stopped (2 pending tasks)
[2023/06/01 16:49:04] [ info] [input] pausing tail.0
[2023/06/01 16:49:04] [ info] [input] pausing storage_backlog.1
[2023/06/01 16:49:04] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=42739117 watch_fd=1
[2023/06/01 16:49:04] [ info] [output:kafka:kafka.0] thread worker #0 stopping...
[2023/06/01 16:49:05] [engine] caught signal (SIGSEGV)
#0  0x5563ae057ff5      in  template_execute() at lib/msgpack-c/include/msgpack/unpack_template.h:172
#1  0x5563ae059b4a      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677
#2  0x5563ad57352f      in  flb_log_event_decoder_next() at src/flb_log_event_decoder.c:285
#3  0x5563ad866978      in  cb_kafka_flush() at plugins/out_kafka/kafka.c:496
#4  0x5563ad47ad3d      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:559
#5  0x5563ae07760a      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#6  0xffffffffffffffff  in  ???() at ???:0

This issue could be caused by high volume of tailing input and kafka output:

[SERVICE]
   flush           1
   log_level       info
   Parsers_File    parsers.conf
   Grace           5
   http_server on
   http_listen 0.0.0.0
   http_port 2020
   health_check on
   storage.backlog.mem_limit 512M
   storage.checksum off
   storage.sync normal
   storage.max_chunks_up 256
   storage.metrics on
   storage.path ./flb-storage/
   total_limit_size 10M

[INPUT]
   Name              tail
   Path              repro.log
   Read_from_Head    True
   buffer_max_size   4M
   buffer_chunk_size 128K
   Refresh_Interval  60
   storage.type filesystem
   mem_buf_limit 50MB

[OUTPUT]
   Name kafka
   Match *
   topics test
   brokers 127.0.0.1:9092
   Workers 1
   format json
   timestamp_format iso8601_ns
   rdkafka.message.max.bytes 10000000
   rdkafka.request.required.acks 1
   rdkafka.log.connection.close false
   storage.total_limit_size 500M

The high volume log is generated by:

while [ true ]
do
    cat <<__EOF__ >>repro.log
message: ok, seq: 0
message: ok, seq: 1
message: ok, seq: 2
message: ok, seq: 3
message: ok, seq: 4
message: ok, seq: 5
message: ok, seq: 6
message: ok, seq: 7
message: ok, seq: 8
message: ok, seq: 9
message: ok, seq: 10
__EOF__
    echo -n .
    sleep 0.1
done

To opt-out for ensuring thread safety on termination of a fluent-bit context, adding opt-out parameter should disabled by:

Hot_Reload.Ensure_Thread_Safety Off

Or, using -W/--disable-thread-safety-on-hot-reload option disable to set up grace -1 before stopping the old fluent-bit context.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:04 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:04 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:04 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 force-pushed the cosmo0920-ensure-thread-safety-on-hot-reloading branch from a8499a5 to 43547c7 Compare June 1, 2023 09:04
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:05 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:05 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:05 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:10 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:11 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:11 — with GitHub Actions Inactive
@cosmo0920 cosmo0920 changed the title config: reload: bin: Add opt-in option to ensure thread safety on hot reloading config: reload: bin: Add opt-out option to ensure thread safety on hot reloading Jun 1, 2023
@cosmo0920 cosmo0920 temporarily deployed to pr June 1, 2023 09:36 — with GitHub Actions Inactive
@edsiper
Copy link
Member

edsiper commented Jun 13, 2023

@cosmo0920, why this is a configurable option and not a default behavior ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants