Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_ecs: only attach metadata keys if their values are non-empty #6614

Merged
merged 8 commits into from
Feb 13, 2023

Conversation

PettitWesley
Copy link
Contributor

@PettitWesley PettitWesley commented Dec 23, 2022

This is rebased on top of: #6589

ECS Metadata Empty metadata fix and performance optimization

Problem Statement

The ECS Filter was first released in AWS for Fluent Bit 2.29.0. The first version works, but has a non-ideal experience when metadata lookups for a log fails, and the code is very inefficient.

Problem 1: Filter can attache metadata keypairs with empty values

When customers configure the ECS filter, they provide a list of metadata key value pairs they want added to logs. The filter looks up metadata from the ECS Agent introspection endpoint.

However, the metadata fetch can fail, for one of two reasons:

  1. ECS Agent introspection endpoint is unreachable.
  2. The filter is trying to attach metadata for a container that is not running as part of an ECS Task. The ECS Agent introspection endpoint only returns metadata for containers that are running as part of ECS Tasks. However, customers can have other containers running on their host. One key example is the ECS Agent, which is itself a container that is not part of an ECS Task. When customers customers collect the log files from /var/lib/docker/containers they will get logs for all containers running on that host. This is due in part to the recent Dockerdual logging featurewhich will always write a containers stdout/stderr to /var/lib/docker/containers. While customers can disable dual logging in the Docker daemon.json config file, it is enabled by default on the ECS optimized AMI and in all vanilla Docker installs. Thus, Fluent Bit would collect logs for the ECS Agent and any other containers that are not part of an ECS Task. When the ECS Filter tries to lookup these containers in the ECS Agent introspection endpoint, it will get an error response.

When either of the above issues occur, the currently released ECS Filter will attach metadata values that are empty. For example, consider the following config.

[FILTER]
    Name              ecs
    Match             task.var.lib.docker.containers.*
    ECS_Tag_Prefix    task.var.lib.docker.containers.
    ADD               container    $ECSContainerName
    ADD               ecs_task_id  $TaskID
    ADD               ecs_family   $TaskDefinitionFamily

If the above errors occur, a customer would get a log like:

{
    "log": "log line from the container..."
    # empty metadata values
    "container": "",
    "ecs_task_id": "",
    "ecs_family": ""
}

Screen Shot 2022-12-18 at 8 55 50 PM

Problem 2: Current ECS Filter is inefficient

If N is the number of metadata key value pairs the customers has configured. Currently, in the filter callback, the ECS Filter calls flb_ra_translate to template the metadata for every single log record it processes times N: https://github.com/fluent/fluent-bit/blob/v2.0.6/plugins/filter_ecs/ecs.c#L1533

Each call requires going through the metadata msgpack object and at least one flb_sds string allocation.

Root Cause of Problems

As explained above, the root cause is the inefficient code contributed in the first version of the filter.

Solution

  1. The key value pairs configured by the customer are static during runtime, so the filter can store a list of key value pairs for each container metadata buffer.
  2. The outgoing msgpack buffer returned by the filter can be packed with the actual number of metadata keys successfully processed.

Estimates

Task Estimate Planned Start date Completion date
Metadata Filter Fixes code implementation 1 12/19/2022 12/20/2022
Testing 0.5 12/20/2022 12/23/2022
PR Review and Release 🤷‍♂️ 🤷‍♂️

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@PettitWesley PettitWesley temporarily deployed to pr December 23, 2022 07:56 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 23, 2022 07:57 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 23, 2022 08:13 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 23, 2022 22:49 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 23, 2022 22:49 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr December 23, 2022 23:03 — with GitHub Actions Inactive
Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work!
Most of the comments were for me to help keep track of my understanding of the code changes.

However there are 2 main concerns

  • Should the retries failed_metadata_requests hash table retry tracker get reset to 0 (or deleted from the failed_metadata_requests table) when requests are successful? That way we track consecutive failures rather than total failures which could impact very long running tests preventing them from getting metadata
  • There may be an off by one error on the agent_endpoint_retries. If set to 1, there may be 0 retries with the existing logic

One question I have is:

  • Is it the code's intention for agent_endpoint_retries to ignore Cluster metadata request failures?

Just for style:

  • Could you squash the commits into 1 or 2? That will help the cherry picking
  • You may want to remove a stray blank line on Line 1619.

Comment on lines +1489 to +1490
tag, *new_val, ctx->agent_endpoint_retries);
flb_free(new_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks good now

if (ret != -1) {
if (val >= FLB_ECS_FILTER_METADATA_RETRIES) {
if (*val >= ctx->agent_endpoint_retries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like if agent_endpoint_retries is set to 1, ecs filter may actually only try 1 time, where it should try 2 times.

val is the number of failed metadata requests

On first failure, the failed_metadata_requests count is set to 1
See lines L1434-L1436

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, thanks, it should be >

@@ -1467,7 +1535,7 @@ static int cb_ecs_filter(const void *data, size_t bytes,
if (check == FLB_TRUE) {
flb_plg_debug(ctx->ins, "Failed to get ECS Metadata for tag %s %d times. "
"Will not attempt to retry the metadata request. Will attach cluster metadata only.",
tag, FLB_ECS_FILTER_METADATA_RETRIES);
tag, ctx->agent_endpoint_retries);
}

if (check == FLB_FALSE && ctx->cluster_metadata_only == FLB_FALSE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tag metadata retrieval does not fail, it would be nice for the retries counter to reset back to 0 or get deleted.

Currently it may be that total retries are being tracked where we only want to count consecutive retries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again, maybe on success you never will try to get the metadata ever again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is ttl on the hash table so i think that there will be more retries after successful retrieval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea on success we will add the container ID and its metadata to the hash table. Container IDs are static for the lifetime of a container.


/* this test is mainly for leak checking on error, not for checking result record */
expected.expected_records = 1; /* 1 record */
expected.expected_pattern = "cluster"; /* cluster key should be added */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears cluster name and task name come from different metadata calls and can fail independantly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@@ -308,6 +403,10 @@ static void flb_test_ecs_filter_cluster_error()
sleep(2);
TEST_CHECK(expected.actual_records == expected.expected_records);
filter_test_destroy(ctx);

/* unset env */
setenv("FLB_ECS_PLUGIN_UNDER_TEST", "", 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should a fail pattern be added here to ensure cluster name is not present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not essential

msgpack_pack_str_body(&tmp_pck,
keypair->key,
len);
len = flb_sds_len(keypair->val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will value always exist? I noticed that in the deletion code you check if value exists before deleting it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comment, I suppose it does always exist. the code is fine, though I'm not sure why in some places val == null is checked before use and other places it is not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, these are pre-processed metadata keypairs, so they are guaranteed to have non-null values.

msgpack_sbuffer_destroy(&tmp_sbuf);
return FLB_FILTER_NOTOUCH;
if (metadata_buffer->keypairs_len > 0) {
mk_list_foreach_safe(head, tmp, &metadata_buffer->metadata_keypairs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great optimization around using record accessor unnecessarily

Comment on lines +367 to +373
if (flb_sds_len(val) == 0) {
flb_plg_info(ctx->ins, "Translation failed for %s : %s for container ID: %s",
metadata_key->key, metadata_key->template, meta->id);
flb_sds_destroy(val);
/* keep trying other keys*/
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose value will never be null or length 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this was to address the other comment asking if we need to check some val==null condition as we do in other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhhh its null if the record accessor translation fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following you here.

"Number of retries for failed metadata requests to ECS Agent Introspection "
"endpoint. The most common cause of failed metadata requests is that the "
"container the metadata request was made for is not part of an ECS Task. "
"Check if you have non-task containers and docker dual logging enabled."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note. No action needed for this

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider omitting blank line

@PettitWesley
Copy link
Contributor Author

@matthewfala

Should the retries failed_metadata_requests hash table retry tracker get reset to 0 (or deleted from the failed_metadata_requests table) when requests are successful? That way we track consecutive failures rather than total failures which could impact very long running tests preventing them from getting metadata

See this isn't needed I think because we only need the request for a container ID to succeed once. Container IDs never change during the container lifetime. Once we get the metadata we have it and will not make another request.

Of course there is the hash table TTL but that applies to both the failure tracking and the stored metadata. So if we have say one failure for a tag, and then we get the metadata, later when the metadata entry expires, the failed tag hash entry will also already be expired.

There may be an off by one error on the agent_endpoint_retries. If set to 1, there may be 0 retries with the existing logic

Thanks, will fix it.

Is it the code's intention for agent_endpoint_retries to ignore Cluster metadata request failures?

Yes. That one should only fail if the agent endpoint is not available. Right now the retries and failure tracking is mainly because you can have non-task containers which will never have metadata and will always fail.

I don't see a use case for stopping the filter from retrying on cluster metadata. If there is one in the future we can add tracking for it.

@PettitWesley PettitWesley temporarily deployed to pr January 19, 2023 04:28 — with GitHub Actions Inactive
@PettitWesley PettitWesley temporarily deployed to pr January 19, 2023 04:29 — with GitHub Actions Inactive
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Jan 19, 2023
@PettitWesley PettitWesley temporarily deployed to pr January 19, 2023 04:50 — with GitHub Actions Inactive
Copy link
Contributor

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you for your followup comments, they make sense. Approving

matthewfala pushed a commit to matthewfala/fluent-bit that referenced this pull request Feb 6, 2023
@edsiper edsiper merged commit 4ce6fb3 into fluent:1.9 Feb 13, 2023
matthewfala pushed a commit to matthewfala/fluent-bit that referenced this pull request Feb 23, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Mar 13, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Apr 25, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request May 2, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Jun 2, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request Jun 8, 2023
matthewfala pushed a commit to matthewfala/fluent-bit that referenced this pull request Sep 23, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this pull request May 22, 2024
zhihonl pushed a commit to zhihonl/fluent-bit that referenced this pull request Aug 20, 2024
swapneils pushed a commit to amazon-contributing/upstream-to-fluent-bit that referenced this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants