filter_ecs: only attach metadata keys if their values are non-empty #6614

PettitWesley · 2022-12-23T07:56:21Z

This is rebased on top of: #6589

ECS Metadata Empty metadata fix and performance optimization

Problem Statement

The ECS Filter was first released in AWS for Fluent Bit 2.29.0. The first version works, but has a non-ideal experience when metadata lookups for a log fails, and the code is very inefficient.

Problem 1: Filter can attache metadata keypairs with empty values

When customers configure the ECS filter, they provide a list of metadata key value pairs they want added to logs. The filter looks up metadata from the ECS Agent introspection endpoint.

However, the metadata fetch can fail, for one of two reasons:

ECS Agent introspection endpoint is unreachable.
The filter is trying to attach metadata for a container that is not running as part of an ECS Task. The ECS Agent introspection endpoint only returns metadata for containers that are running as part of ECS Tasks. However, customers can have other containers running on their host. One key example is the ECS Agent, which is itself a container that is not part of an ECS Task. When customers customers collect the log files from /var/lib/docker/containers they will get logs for all containers running on that host. This is due in part to the recent Dockerdual logging featurewhich will always write a containers stdout/stderr to /var/lib/docker/containers. While customers can disable dual logging in the Docker daemon.json config file, it is enabled by default on the ECS optimized AMI and in all vanilla Docker installs. Thus, Fluent Bit would collect logs for the ECS Agent and any other containers that are not part of an ECS Task. When the ECS Filter tries to lookup these containers in the ECS Agent introspection endpoint, it will get an error response.

When either of the above issues occur, the currently released ECS Filter will attach metadata values that are empty. For example, consider the following config.

[FILTER]
    Name              ecs
    Match             task.var.lib.docker.containers.*
    ECS_Tag_Prefix    task.var.lib.docker.containers.
    ADD               container    $ECSContainerName
    ADD               ecs_task_id  $TaskID
    ADD               ecs_family   $TaskDefinitionFamily

If the above errors occur, a customer would get a log like:

{
    "log": "log line from the container..."
    # empty metadata values
    "container": "",
    "ecs_task_id": "",
    "ecs_family": ""
}

Problem 2: Current ECS Filter is inefficient

If N is the number of metadata key value pairs the customers has configured. Currently, in the filter callback, the ECS Filter calls flb_ra_translate to template the metadata for every single log record it processes times N: https://github.com/fluent/fluent-bit/blob/v2.0.6/plugins/filter_ecs/ecs.c#L1533

Each call requires going through the metadata msgpack object and at least one flb_sds string allocation.

Root Cause of Problems

As explained above, the root cause is the inefficient code contributed in the first version of the filter.

The outgoing msgpack buffer is packed with the configured number of metadata keys, with no check for whether or not we failed to obtain the metadata: https://github.com/fluent/fluent-bit/blob/v2.0.6/plugins/filter_ecs/ecs.c#L1520
flb_ra_translate is called for every log record in the filter callback times the number of metadata keys: https://github.com/fluent/fluent-bit/blob/v2.0.6/plugins/filter_ecs/ecs.c#L1533

Solution

The key value pairs configured by the customer are static during runtime, so the filter can store a list of key value pairs for each container metadata buffer.
The outgoing msgpack buffer returned by the filter can be packed with the actual number of metadata keys successfully processed.

Estimates

Task	Estimate	Planned Start date	Completion date
Metadata Filter Fixes code implementation	1	12/19/2022	12/20/2022
Testing	0.5	12/20/2022	12/23/2022
PR Review and Release	🤷‍♂️	🤷‍♂️

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Attached local packaging test output showing all targets (including any new ones) build.

Documentation

Documentation required for this feature

Backporting

Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Signed-off-by: Wesley Pettit <[email protected]>

matthewfala

Amazing work!
Most of the comments were for me to help keep track of my understanding of the code changes.

However there are 2 main concerns

Should the retries failed_metadata_requests hash table retry tracker get reset to 0 (or deleted from the failed_metadata_requests table) when requests are successful? That way we track consecutive failures rather than total failures which could impact very long running tests preventing them from getting metadata
There may be an off by one error on the agent_endpoint_retries. If set to 1, there may be 0 retries with the existing logic

One question I have is:

Is it the code's intention for agent_endpoint_retries to ignore Cluster metadata request failures?

Just for style:

Could you squash the commits into 1 or 2? That will help the cherry picking
You may want to remove a stray blank line on Line 1619.

matthewfala · 2023-01-19T00:18:43Z

plugins/filter_ecs/ecs.c

+                    tag, *new_val, ctx->agent_endpoint_retries);
+        flb_free(new_val);


This section looks good now

matthewfala · 2023-01-19T00:28:14Z

plugins/filter_ecs/ecs.c

    if (ret != -1) {
-        if (val >= FLB_ECS_FILTER_METADATA_RETRIES) {
+        if (*val >= ctx->agent_endpoint_retries) {


It seems like if agent_endpoint_retries is set to 1, ecs filter may actually only try 1 time, where it should try 2 times.

val is the number of failed metadata requests

On first failure, the failed_metadata_requests count is set to 1
See lines L1434-L1436

ah I see, thanks, it should be >

matthewfala · 2023-01-19T00:32:12Z

plugins/filter_ecs/ecs.c

@@ -1467,7 +1535,7 @@ static int cb_ecs_filter(const void *data, size_t bytes,
    if (check == FLB_TRUE) {
        flb_plg_debug(ctx->ins, "Failed to get ECS Metadata for tag %s %d times. "
                      "Will not attempt to retry the metadata request. Will attach cluster metadata only.",
-                      tag, FLB_ECS_FILTER_METADATA_RETRIES);
+                      tag, ctx->agent_endpoint_retries);
    }

    if (check == FLB_FALSE && ctx->cluster_metadata_only == FLB_FALSE) {


If tag metadata retrieval does not fail, it would be nice for the retries counter to reset back to 0 or get deleted.

Currently it may be that total retries are being tracked where we only want to count consecutive retries.

Then again, maybe on success you never will try to get the metadata ever again

But there is ttl on the hash table so i think that there will be more retries after successful retrieval

Yea on success we will add the container ID and its metadata to the hash table. Container IDs are static for the lifetime of a container.

matthewfala · 2023-01-19T00:55:12Z

tests/runtime/filter_ecs.c

+
+    /* this test is mainly for leak checking on error, not for checking result record */
+    expected.expected_records = 1; /* 1 record  */
+    expected.expected_pattern = "cluster"; /* cluster key should be added */


It appears cluster name and task name come from different metadata calls and can fail independantly

matthewfala · 2023-01-19T01:01:43Z

tests/runtime/filter_ecs.c

@@ -308,6 +403,10 @@ static void flb_test_ecs_filter_cluster_error()
    sleep(2);
    TEST_CHECK(expected.actual_records == expected.expected_records);
    filter_test_destroy(ctx);
+
+    /* unset env */
+    setenv("FLB_ECS_PLUGIN_UNDER_TEST", "", 1);


should a fail pattern be added here to ensure cluster name is not present?

Not essential

matthewfala · 2023-01-19T01:04:42Z

plugins/filter_ecs/ecs.c

+                msgpack_pack_str_body(&tmp_pck,
+                                    keypair->key,
+                                    len);
+                len = flb_sds_len(keypair->val);


will value always exist? I noticed that in the deletion code you check if value exists before deleting it.

see other comment, I suppose it does always exist. the code is fine, though I'm not sure why in some places val == null is checked before use and other places it is not

in this case, these are pre-processed metadata keypairs, so they are guaranteed to have non-null values.

matthewfala · 2023-01-19T01:12:28Z

plugins/filter_ecs/ecs.c

-                msgpack_sbuffer_destroy(&tmp_sbuf);
-                return FLB_FILTER_NOTOUCH;
+        if (metadata_buffer->keypairs_len > 0) {
+            mk_list_foreach_safe(head, tmp, &metadata_buffer->metadata_keypairs) {


great optimization around using record accessor unnecessarily

matthewfala · 2023-01-19T01:23:11Z

plugins/filter_ecs/ecs.c

+        if (flb_sds_len(val) == 0) {
+            flb_plg_info(ctx->ins, "Translation failed for %s : %s for container ID: %s",
+                         metadata_key->key, metadata_key->template, meta->id);
+            flb_sds_destroy(val);
+            /* keep trying other keys*/
+            continue;
+        }


I suppose value will never be null or length 0

(this was to address the other comment asking if we need to check some val==null condition as we do in other places)

uhhh its null if the record accessor translation fails

I'm not following you here.

matthewfala · 2023-01-19T01:48:59Z

plugins/filter_ecs/ecs.c

+     "Number of retries for failed metadata requests to ECS Agent Introspection "
+     "endpoint. The most common cause of failed metadata requests is that the "
+     "container the metadata request was made for is not part of an ECS Task. "
+     "Check if you have non-task containers and docker dual logging enabled."


This can also happen if agent is updated https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html

Just a note. No action needed for this

matthewfala · 2023-01-19T01:50:46Z

plugins/filter_ecs/ecs.c

}
+

consider omitting blank line

PettitWesley · 2023-01-19T04:25:04Z

@matthewfala

Should the retries failed_metadata_requests hash table retry tracker get reset to 0 (or deleted from the failed_metadata_requests table) when requests are successful? That way we track consecutive failures rather than total failures which could impact very long running tests preventing them from getting metadata

See this isn't needed I think because we only need the request for a container ID to succeed once. Container IDs never change during the container lifetime. Once we get the metadata we have it and will not make another request.

Of course there is the hash table TTL but that applies to both the failure tracking and the stored metadata. So if we have say one failure for a tag, and then we get the metadata, later when the metadata entry expires, the failed tag hash entry will also already be expired.

There may be an off by one error on the agent_endpoint_retries. If set to 1, there may be 0 retries with the existing logic

Thanks, will fix it.

Is it the code's intention for agent_endpoint_retries to ignore Cluster metadata request failures?

Yes. That one should only fail if the agent endpoint is not available. Right now the retries and failure tracking is mainly because you can have non-task containers which will never have metadata and will always fail.

I don't see a use case for stopping the filter from retrying on cluster metadata. If there is one in the future we can add tracking for it.

Signed-off-by: Wesley Pettit <[email protected]>

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

matthewfala

Great! Thank you for your followup comments, they make sense. Approving

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added 6 commits December 19, 2022 23:20

tests: runtime: filter_ecs: add test for marking a tag as failed

c287e4e

Signed-off-by: Wesley Pettit <[email protected]>

filter_ecs: fix mark tag as failed bug in flb_hash_get call

8d7b5a3

Signed-off-by: Wesley Pettit <[email protected]>

filter_ecs: make failed metadata request retries configurable

aa173e0

Signed-off-by: Wesley Pettit <[email protected]>

filter_ecs: fix bug in setting new value in retry hash

0f05bce

Signed-off-by: Wesley Pettit <[email protected]>

filter_ecs: only attach metadata key if value is non-empty

99ae426

Signed-off-by: Wesley Pettit <[email protected]>

filter_ecs: add test for empty metadata values

b566c8c

Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley requested review from edsiper, leonardo-albertovich, fujimotos and koleini as code owners December 23, 2022 07:56

github-actions bot added the docs-required label Dec 23, 2022

PettitWesley temporarily deployed to pr December 23, 2022 07:56 — with GitHub Actions Inactive

PettitWesley temporarily deployed to pr December 23, 2022 07:57 — with GitHub Actions Inactive

PettitWesley temporarily deployed to pr December 23, 2022 08:13 — with GitHub Actions Inactive

filter_ecs: call mk_list_is_set in destroy functions for safety

c44ab91

Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley temporarily deployed to pr December 23, 2022 22:49 — with GitHub Actions Inactive

PettitWesley temporarily deployed to pr December 23, 2022 23:03 — with GitHub Actions Inactive

matthewfala reviewed Jan 19, 2023

View reviewed changes

filter_ecs: fix off by one error in checking agent_endpoint_retries

fa83e2f

Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley temporarily deployed to pr January 19, 2023 04:28 — with GitHub Actions Inactive

PettitWesley temporarily deployed to pr January 19, 2023 04:29 — with GitHub Actions Inactive

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Jan 19, 2023

filter_ecs: all fixes from December 2022 in these PRs:

24934c9

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley temporarily deployed to pr January 19, 2023 04:50 — with GitHub Actions Inactive

matthewfala approved these changes Jan 19, 2023

View reviewed changes

matthewfala pushed a commit to matthewfala/fluent-bit that referenced this pull request Feb 6, 2023

filter_ecs: all fixes from December 2022 in these PRs:

dedc540

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

edsiper merged commit 4ce6fb3 into fluent:1.9 Feb 13, 2023

matthewfala pushed a commit to matthewfala/fluent-bit that referenced this pull request Feb 23, 2023

filter_ecs: all fixes from December 2022 in these PRs:

53aae5b

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Mar 13, 2023

filter_ecs: all fixes from December 2022 in these PRs:

7c84403

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Apr 25, 2023

filter_ecs: all fixes from December 2022 in these PRs:

53ee79f

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request May 2, 2023

filter_ecs: all fixes from December 2022 in these PRs:

7180804

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Jun 2, 2023

filter_ecs: all fixes from December 2022 in these PRs:

5dc2cc2

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Jun 8, 2023

filter_ecs: all fixes from December 2022 in these PRs:

09297b4

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

matthewfala pushed a commit to matthewfala/fluent-bit that referenced this pull request Sep 23, 2023

filter_ecs: all fixes from December 2022 in these PRs:

a9fc238

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request May 22, 2024

filter_ecs: all fixes from December 2022 in these PRs:

d1b6310

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

zhihonl pushed a commit to zhihonl/fluent-bit that referenced this pull request Aug 20, 2024

filter_ecs: all fixes from December 2022 in these PRs:

88f4952

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

swapneils pushed a commit to amazon-contributing/upstream-to-fluent-bit that referenced this pull request Oct 3, 2024

filter_ecs: all fixes from December 2022 in these PRs:

f5525cb

fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter_ecs: only attach metadata keys if their values are non-empty #6614

filter_ecs: only attach metadata keys if their values are non-empty #6614

PettitWesley commented Dec 23, 2022 •

edited

Loading

matthewfala left a comment

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

PettitWesley Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

PettitWesley Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

PettitWesley Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

PettitWesley Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

PettitWesley Jan 19, 2023

PettitWesley Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

matthewfala Jan 19, 2023

PettitWesley commented Jan 19, 2023

matthewfala left a comment

		tag, *new_val, ctx->agent_endpoint_retries);
		flb_free(new_val);

filter_ecs: only attach metadata keys if their values are non-empty #6614

filter_ecs: only attach metadata keys if their values are non-empty #6614

Conversation

PettitWesley commented Dec 23, 2022 • edited Loading

ECS Metadata Empty metadata fix and performance optimization

Problem Statement

Problem 1: Filter can attache metadata keypairs with empty values

Problem 2: Current ECS Filter is inefficient

Solution

Estimates

matthewfala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PettitWesley commented Jan 19, 2023

matthewfala left a comment

Choose a reason for hiding this comment

PettitWesley commented Dec 23, 2022 •

edited

Loading