Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_es: estimate bulk size to less reallocation(#3775) #3788

Merged
merged 1 commit into from
Jul 19, 2021

Conversation

nokute78
Copy link
Collaborator

Fixes #3775
This patch is to reduce realloc in es_bulk_append.

  1. Initial bulk size is a size of messagepack
  2. Estimate appending size from current convert ratio (json->messagepack ) and left messagepack size.

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • [N/A] Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • [N/A] Documentation required for this feature

Example Configuration

[INPUT]
    Name dummy
    Rate 100000

[OUTPUT]
    Name es
    Match *

Valgrind output

$ valgrind --leak-check=full bin/fluent-bit -c 3775/a.conf 
==27669== Memcheck, a memory error detector
==27669== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27669== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==27669== Command: bin/fluent-bit -c 3775/a.conf
==27669== 
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/07/18 22:02:55] [ info] [engine] started (pid=27669)
[2021/07/18 22:02:55] [ info] [storage] version=1.1.1, initializing...
[2021/07/18 22:02:55] [ info] [storage] in-memory
[2021/07/18 22:02:55] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/07/18 22:02:55] [ info] [cmetrics] version=0.1.5
[2021/07/18 22:02:55] [ info] [sp] stream processor started
^C[2021/07/18 22:02:57] [engine] caught signal (SIGINT)
==27669== Warning: client switching stacks?  SP change: 0x57e4958 --> 0x6618cb0
==27669==          to suppress, use: --max-stackframe=14893912 or greater
==27669== Warning: client switching stacks?  SP change: 0x6618c28 --> 0x57e4958
==27669==          to suppress, use: --max-stackframe=14893776 or greater
==27669== Warning: client switching stacks?  SP change: 0x57e4958 --> 0x6618c28
==27669==          to suppress, use: --max-stackframe=14893776 or greater
==27669==          further instances of this message will not be shown.
[2021/07/18 22:02:57] [ warn] [engine] service will stop in 5 seconds
[2021/07/18 22:02:59] [error] [src/flb_http_client.c:1159 errno=32] Broken pipe
[2021/07/18 22:02:59] [ warn] [output:es:es.0] http_do=-1 URI=/_bulk
[2021/07/18 22:02:59] [ warn] [engine] failed to flush chunk '27669-1626613375.626651597.flb', retry in 11 seconds: task_id=0, input=dummy.0 > output=es.0 (out_id=0)
[2021/07/18 22:03:01] [ info] [engine] service stopped
==27669== Warning: invalid file descriptor -1 in syscall close()
==27669== 
==27669== HEAP SUMMARY:
==27669==     in use at exit: 0 bytes in 0 blocks
==27669==   total heap usage: 257,155 allocs, 257,155 frees, 1,199,009,254 bytes allocated
==27669== 
==27669== All heap blocks were freed -- no leaks are possible
==27669== 
==27669== For lists of detected and suppressed errors, rerun with: -s
==27669== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@nokute78
Copy link
Collaborator Author

For example,

  1. A size of incoming MessagePack is 1000KB.
  2. out_es allocates 1000KB as a bulk.
  3. out_es converts MessagePack into JSON and writes to bulk over and over.
  4. Bulk is full. In this time, the rest size of MessagePack is 600KB. It means MessagePack size : JSON size = 400 : 1000 = 2 : 5.
  5. Estimated rest size of JSON is 600KB x 5 / 2 = 1500KB.
  6. out_es reallocates 1000KB+1500KB.
  7. out_es converts again and again.

@nokute78 nokute78 force-pushed the estimate_es_bulk_size branch from ae23b2c to f587cd3 Compare July 19, 2021 11:56
@edsiper edsiper merged commit a534119 into fluent:master Jul 19, 2021
@edsiper
Copy link
Member

edsiper commented Jul 19, 2021

thanks!

@nokute78 nokute78 deleted the estimate_es_bulk_size branch July 29, 2021 11:09
nokute78 added a commit to nokute78/fluent-bit that referenced this pull request Aug 9, 2021
We updated re-allocation logic(fluent#3788).
It tries to devide by converted_size.
If converted_size is 0, it causes SIGFPE and exit code 136.

Signed-off-by: Takahiro Yamashita <[email protected]>
nokute78 added a commit to nokute78/fluent-bit that referenced this pull request Aug 9, 2021
We updated re-allocation logic(fluent#3788).
It tries to divide by converted_size.
If converted_size is 0, it causes SIGFPE and exit code 136.

Signed-off-by: Takahiro Yamashita <[email protected]>
nokute78 added a commit to nokute78/fluent-bit that referenced this pull request Aug 9, 2021
We updated re-allocation logic(fluent#3788).
It tries to divide by converted_size.
If converted_size is 0, it causes SIGFPE and exit code 136.

Signed-off-by: Takahiro Yamashita <[email protected]>
edsiper pushed a commit that referenced this pull request Aug 13, 2021
We updated re-allocation logic(#3788).
It tries to divide by converted_size.
If converted_size is 0, it causes SIGFPE and exit code 136.

Signed-off-by: Takahiro Yamashita <[email protected]>
edsiper pushed a commit that referenced this pull request Aug 14, 2021
We updated re-allocation logic(#3788).
It tries to divide by converted_size.
If converted_size is 0, it causes SIGFPE and exit code 136.

Signed-off-by: Takahiro Yamashita <[email protected]>
@GuillaumeSmaha
Copy link
Contributor

GuillaumeSmaha commented Dec 8, 2021

Hi @nokute78 ,
I am having SEGV (#4195 (comment)) and I think it is related to this PR.
In some case, append_size is still inferior to required which means memcpy is writing outside of the buffer and can create a SEGV.

On 1.8.10 version, I try by adding a log to check this case GuillaumeSmaha@5b908dd and I got the log few time with a SEGV just after:

�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] converted_size is 0
�[1m[�[0m2021/12/08 11:55:05�[1m]�[0m [�[93mdebug�[0m] [out_es] required size (197133, append: 4096) is still superior to available buffer (37999, previous available: 33903)

Thread 2 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff61ff700 (LWP 45)]
__memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
262	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:262
#1  0x00005555556fc274 in es_bulk_append (bulk=0x7ffff5400d60, 
    index=0x7ffff4ac0a40 "{\"create\":{\"_index\":\"logstash-2021.12.08\",\"_type\":\"_doc\",\"_id\":\"a76df76b-d2c7-9d2b-e172-93e0ec3c3d4f\"}}\n", i_len=104, 
    json=0x7ffff253fad0 "{\"@timestamp\":\"2021-12-08T11:54:07.027Z\",\"time\":\"2021-12-08T11:54:07.027621077Z\",\"stream\":\"stdout\",\"_p\":\"P\",\"log\":\"\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\u0000\\"..., j_len=196967, whole_size=33903, converted_size=0)
    at /tmpdir/fluent-bit/plugins/out_es/es_bulk.c:111
#2  0x00005555556f9af9 in elasticsearch_format (config=0x7ffff6632f80, 
    ins=0x7ffff6648f00, plugin_context=0x7ffff54a9600, flush_ctx=0x0, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, data=0x7ffff2341f00, bytes=33903, out_data=0x7ffff4ac0f60, 
    out_size=0x7ffff4ac0f58) at /tmpdir/fluent-bit/plugins/out_es/es.c:537
#3  0x00005555556fa8c5 in cb_es_flush (data=0x7ffff2341f00, bytes=33903, 
    tag=0x7ffff54a2700 "kube.var.log.containers.toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588_toolkit-jobs_toolkit-jobs-job-48bfef83-e663-4c37-b55f-50ea1db9a588-1-db65750898be03ec7ef665854d8036800d9609adeb19a0cc5b021"..., 
    tag_len=215, ins=0x7ffff6648f00, out_context=0x7ffff54a9600, 
    config=0x7ffff6632f80) at /tmpdir/fluent-bit/plugins/out_es/es.c:772
--Type <RET> for more, q to quit, c to continue without paging--
#4  0x000055555566b2df in output_pre_cb_flush ()
    at /tmpdir/fluent-bit/include/fluent-bit/flb_output.h:514
#5  0x0000555555b69d47 in co_init ()
    at /tmpdir/fluent-bit/lib/monkey/deps/flb_libco/amd64.c:117
#6  0x0000000000000000 in ?? ()
(gdb) 

gdb-fluent-bit-4vljw.log


I also try to revert the change by using:

   if (available < required) {
        append_size = available + required + ES_BULK_CHUNK;
        if (append_size < ES_BULK_CHUNK) {
            /* append at least ES_BULK_CHUNK size */
            append_size = ES_BULK_CHUNK;
        }

Until now, it is working for 20 minutes without SEGV. (it was crashing after few minutes)
UPDATE: Running for 19h without issue

I created #4412 to follow the issue

@nokute78
Copy link
Collaborator Author

nokute78 commented Dec 9, 2021

@GuillaumeSmaha Thank you for pointing it.
Could you sent a PR?

@GuillaumeSmaha
Copy link
Contributor

@GuillaumeSmaha Thank you for pointing it. Could you sent a PR?

I will try to do a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic output spends 92% of time in realloc because of linear buffer growth.
3 participants