Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_calyptia_fleet: fix memory leaks during hot-reload of current and new configurations. #8133

Merged
merged 5 commits into from
Nov 4, 2023

Conversation

pwhelan
Copy link
Contributor

@pwhelan pwhelan commented Nov 3, 2023

Summary

I have identified and fixed several memory leaks in fleet that happen during the reloading of the current configuration (when it exists from a previous run) or any new configurations loaded from the calyptia API.

  • Initialization of plugin instances when testing the new configuration in test_config_is_valid.
  • Memory leaked due to premature return (FLB_OUTPUT_RETURN) in the collect callback.
  • File paths (flb_sds_t) leaked during error handling.

The function test_config_is_valid was attempting to be far too intelligent by instantiating the configured plugins. Attempting to free the memory instantiated by the initialization is problematic since flb_output_exit also frees the TLS variables it uses to store co-routine parameters, making it not thread-safe.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@pwhelan
Copy link
Contributor Author

pwhelan commented Nov 3, 2023

Here is a valgrind run:

valgrind --leak-check=full ./bin/fluent-bit -c fleet.conf
==150479== Memcheck, a memory error detector
==150479== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==150479== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==150479== Command: ./bin/fluent-bit -c fleet.conf
==150479==
Fluent Bit v2.2.0
* Copyright (C) 2015-2023 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/11/03 11:10:24] [ info] [fluent bit] version=2.2.0, commit=0ae1dba373, pid=150479
[2023/11/03 11:10:24] [ info] [custom:calyptia:calyptia.0] custom initialized!
[2023/11/03 11:10:24] [ info] [storage] ver=1.5.1, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/11/03 11:10:24] [ info] [cmetrics] version=0.6.4
[2023/11/03 11:10:24] [ info] [ctraces ] version=0.3.1
[2023/11/03 11:10:24] [ info] [input:fluentbit_metrics:fluentbit_metrics.0] initializing
[2023/11/03 11:10:24] [ info] [input:fluentbit_metrics:fluentbit_metrics.0] storage_strategy='memory' (memory only)
[2023/11/03 11:10:25] [ info] [input:calyptia_fleet:calyptia_fleet.1] initializing
[2023/11/03 11:10:25] [ info] [input:calyptia_fleet:calyptia_fleet.1] storage_strategy='memory' (memory only)
[2023/11/03 11:10:25] [ info] [input:calyptia_fleet:calyptia_fleet.1] initializing calyptia fleet input.
[2023/11/03 11:10:25] [ info] [input:calyptia_fleet:calyptia_fleet.1] loading configuration from /tmp/calyptia-fleet/99d70f0171de7aec9b8ae68b1f6e32b7dd65763b0b73dd603ed95a112b0bf8ee/fleet-connect/new.ini.
[2023/11/03 11:10:25] [ info] [sp] stream processor started
[2023/11/03 11:10:30] [engine] caught signal (SIGHUP)
[2023/11/03 11:10:30] [ info] reloading instance pid=150479 tid=0x5a1c8c0
[2023/11/03 11:10:30] [ info] [reload] stop everything of the old context
[2023/11/03 11:10:30] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[2023/11/03 11:10:30] [ info] [input] pausing fluentbit_metrics.0
[2023/11/03 11:10:30] [ info] [input] pausing calyptia_fleet.1
[2023/11/03 11:10:30] [ info] [engine] service has stopped (0 pending tasks)
[2023/11/03 11:10:30] [ info] [input] pausing fluentbit_metrics.0
[2023/11/03 11:10:30] [ info] [input] pausing calyptia_fleet.1
[2023/11/03 11:10:31] [ info] [reload] start everything
[2023/11/03 11:10:31] [ info] [fluent bit] version=2.2.0, commit=0ae1dba373, pid=150479
[2023/11/03 11:10:31] [ info] [custom:calyptia:calyptia.0] custom initialized!
[2023/11/03 11:10:31] [ info] [storage] ver=1.5.1, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/11/03 11:10:31] [ info] [cmetrics] version=0.6.4
[2023/11/03 11:10:31] [ info] [output:stdout:stdout.0] worker #0 started
[2023/11/03 11:10:31] [ info] [ctraces ] version=0.3.1
[2023/11/03 11:10:31] [ info] [input:dummy:dummy.0] initializing
[2023/11/03 11:10:31] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2023/11/03 11:10:31] [ info] [input:fluentbit_metrics:fluentbit_metrics.1] initializing
[2023/11/03 11:10:31] [ info] [input:fluentbit_metrics:fluentbit_metrics.1] storage_strategy='memory' (memory only)
[2023/11/03 11:10:31] [ info] [input:calyptia_fleet:calyptia_fleet.2] initializing
[2023/11/03 11:10:31] [ info] [input:calyptia_fleet:calyptia_fleet.2] storage_strategy='memory' (memory only)
[2023/11/03 11:10:31] [ info] [input:calyptia_fleet:calyptia_fleet.2] initializing calyptia fleet input.
[2023/11/03 11:10:33] [ info] [output:calyptia:calyptia.1] connected to Calyptia, agent_id='39cb82dd-9bdf-4e78-a540-117124e775e9'
[2023/11/03 11:10:33] [ info] [sp] stream processor started
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020634.000109267, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020635.001949711, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020635.997304063, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020636.996986061, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020637.998664067, {}], {"message"=>"dummy"}]
==150479== Thread 2 flb-pipeline:
==150479== Conditional jump or move depends on uninitialised value(s)
==150479==    at 0x21B26E: output_pre_cb_flush (flb_output.h:584)
==150479==    by 0xE53826: co_init (amd64.c:117)
==150479==
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020639.002344548, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020639.997928199, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020640.998397962, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020641.996954959, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020642.997545603, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020643.997002975, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020644.997224119, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020645.997260648, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020646.997007810, {}], {"message"=>"dummy"}]
[2023/11/03 11:10:48] [ info] [input:calyptia_fleet:calyptia_fleet.2] loading configuration from /tmp/calyptia-fleet/99d70f0171de7aec9b8ae68b1f6e32b7dd65763b0b73dd603ed95a112b0bf8ee/fleet-connect/1698991946.ini.
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020648.057333993, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020648.997530864, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020649.997006075, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020650.997347887, {}], {"message"=>"dummy"}]
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020651.996956881, {}], {"message"=>"dummy"}]
[2023/11/03 11:10:53] [engine] caught signal (SIGHUP)
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020652.997226674, {}], {"message"=>"dummy"}]
[2023/11/03 11:10:54] [ info] reloading instance pid=150479 tid=0x5a1c8c0
[2023/11/03 11:10:54] [ info] [reload] stop everything of the old context
[2023/11/03 11:10:54] [ warn] [engine] service will shutdown when all remaining tasks are flushed
[0] dummy.4aaf0fd6-0f0c-47a3-92b1-7432382f5f3f: [[1699020653.997020220, {}], {"message"=>"dummy"}]
[2023/11/03 11:10:54] [ info] [input] pausing dummy.0
[2023/11/03 11:10:54] [ info] [input] pausing fluentbit_metrics.1
[2023/11/03 11:10:54] [ info] [input] pausing calyptia_fleet.2
[2023/11/03 11:10:54] [ info] [engine] service has stopped (0 pending tasks)
[2023/11/03 11:10:54] [ info] [input] pausing dummy.0
[2023/11/03 11:10:54] [ info] [input] pausing fluentbit_metrics.1
[2023/11/03 11:10:54] [ info] [input] pausing calyptia_fleet.2
[2023/11/03 11:10:54] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/11/03 11:10:54] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2023/11/03 11:10:55] [ info] [reload] start everything
[2023/11/03 11:10:55] [ info] [fluent bit] version=2.2.0, commit=0ae1dba373, pid=150479
[2023/11/03 11:10:55] [ info] [custom:calyptia:calyptia.0] custom initialized!
[2023/11/03 11:10:55] [ info] [storage] ver=1.5.1, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2023/11/03 11:10:55] [ info] [cmetrics] version=0.6.4
[2023/11/03 11:10:55] [ info] [output:stdout:stdout.0] worker #0 started
[2023/11/03 11:10:55] [ info] [ctraces ] version=0.3.1
[2023/11/03 11:10:55] [ info] [input:dummy:dummy.0] initializing
[2023/11/03 11:10:55] [ info] [input:dummy:dummy.0] storage_strategy='memory' (memory only)
[2023/11/03 11:10:55] [ info] [input:fluentbit_metrics:fluentbit_metrics.1] initializing
[2023/11/03 11:10:55] [ info] [input:fluentbit_metrics:fluentbit_metrics.1] storage_strategy='memory' (memory only)
[2023/11/03 11:10:55] [ info] [input:calyptia_fleet:calyptia_fleet.2] initializing
[2023/11/03 11:10:55] [ info] [input:calyptia_fleet:calyptia_fleet.2] storage_strategy='memory' (memory only)
[2023/11/03 11:10:55] [ info] [input:calyptia_fleet:calyptia_fleet.2] initializing calyptia fleet input.
^C[2023/11/03 11:10:55] [ info] [output:calyptia:calyptia.1] connected to Calyptia, agent_id='39cb82dd-9bdf-4e78-a540-117124e775e9'
[2023/11/03 11:10:55] [engine] caught signal ([2023/11/03 11:10:55] [ info] [sp] stream processor started
SIGINT)
[2023/11/03 11:10:55] [ warn] [engine] service will shutdown in max 5 seconds
[2023/11/03 11:10:55] [ info] [input] pausing dummy.0
[2023/11/03 11:10:55] [ info] [input] pausing fluentbit_metrics.1
[2023/11/03 11:10:55] [ info] [input] pausing calyptia_fleet.2
[2023/11/03 11:10:55] [ info] [engine] service has stopped (0 pending tasks)
[2023/11/03 11:10:55] [ info] [input] pausing dummy.0
[2023/11/03 11:10:55] [ info] [input] pausing fluentbit_metrics.1
[2023/11/03 11:10:55] [ info] [input] pausing calyptia_fleet.2
[2023/11/03 11:10:55] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/11/03 11:10:55] [ info] [output:stdout:stdout.0] thread worker #0 stopped
==150479==
==150479== HEAP SUMMARY:
==150479==     in use at exit: 16,282 bytes in 27 blocks
==150479==   total heap usage: 239,052 allocs, 239,025 frees, 31,304,544 bytes allocated
==150479==
==150479== Thread 1:
==150479== 27 bytes in 1 blocks are definitely lost in loss record 3 of 19
==150479==    at 0x4841848: malloc (vg_replace_malloc.c:431)
==150479==    by 0x1F21FD: flb_malloc (flb_mem.h:80)
==150479==    by 0x1F23DC: sds_alloc (flb_sds.c:41)
==150479==    by 0x1F245F: flb_sds_create_len (flb_sds.c:62)
==150479==    by 0x1F2501: flb_sds_create (flb_sds.c:88)
==150479==    by 0x1CB269: flb_service_conf_path_set (fluent-bit.c:736)
==150479==    by 0x1CB2DC: service_configure (fluent-bit.c:762)
==150479==    by 0x1CC441: flb_main (fluent-bit.c:1298)
==150479==    by 0x1CC75C: main (fluent-bit.c:1439)
==150479==
==150479== 123 bytes in 1 blocks are definitely lost in loss record 12 of 19
==150479==    at 0x4841848: malloc (vg_replace_malloc.c:431)
==150479==    by 0x1F21FD: flb_malloc (flb_mem.h:80)
==150479==    by 0x1F23DC: sds_alloc (flb_sds.c:41)
==150479==    by 0x1F245F: flb_sds_create_len (flb_sds.c:62)
==150479==    by 0x1F2501: flb_sds_create (flb_sds.c:88)
==150479==    by 0x290657: flb_reload (flb_reload.c:400)
==150479==    by 0x1CC621: flb_main (fluent-bit.c:1397)
==150479==    by 0x1CC75C: main (fluent-bit.c:1439)
==150479==
==150479== 10,927 (216 direct, 10,711 indirect) bytes in 1 blocks are definitely lost in loss record 19 of 19
==150479==    at 0x48469B3: calloc (vg_replace_malloc.c:1554)
==150479==    by 0x30C310: flb_calloc (flb_mem.h:95)
==150479==    by 0x31121C: flb_http_client (flb_http_client.c:713)
==150479==    by 0x4BBB2B: in_calyptia_fleet_collect (in_calyptia_fleet.c:783)
==150479==    by 0x1FF65A: input_pre_cb_collect (flb_input.h:517)
==150479==    by 0xE53826: co_init (amd64.c:117)
==150479==
==150479== LEAK SUMMARY:
==150479==    definitely lost: 366 bytes in 3 blocks
==150479==    indirectly lost: 10,711 bytes in 12 blocks
==150479==      possibly lost: 0 bytes in 0 blocks
==150479==    still reachable: 5,205 bytes in 12 blocks
==150479==         suppressed: 0 bytes in 0 blocks
==150479== Reachable blocks (those to which a pointer was found) are not shown.
==150479== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==150479==
==150479== Use --track-origins=yes to see where uninitialised values come from
==150479== For lists of detected and suppressed errors, rerun with: -s
==150479== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)

There are still some memory leaks which are out of scope for this PR:

  • service_configure (fluent-bit.c)
  • flb_reload (flb_reload.c)

I will attempt to address these in another pull request.

Free memory during error handling when attempting to load new configurations.

Remove code that initializes configurations during testing since it is
impossible to free that memory later without crashing fluent-bit. This is due
to the use of TLS variables for the co-routine parameters for outputs and
how they are freed in flb_output_exit.

Signed-off-by: Phillip Whelan <[email protected]>
@pwhelan pwhelan force-pushed the pwhelan-fleet-reload-leak branch from 611f94f to 6d9d9be Compare November 3, 2023 15:47
@edsiper edsiper merged commit 8207c71 into master Nov 4, 2023
42 checks passed
@edsiper edsiper deleted the pwhelan-fleet-reload-leak branch November 4, 2023 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants