-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter_ecs: only attach metadata keys if their values are non-empty #6614
Conversation
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
Signed-off-by: Wesley Pettit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
Most of the comments were for me to help keep track of my understanding of the code changes.
However there are 2 main concerns
- Should the retries failed_metadata_requests hash table retry tracker get reset to 0 (or deleted from the failed_metadata_requests table) when requests are successful? That way we track consecutive failures rather than total failures which could impact very long running tests preventing them from getting metadata
- There may be an off by one error on the agent_endpoint_retries. If set to 1, there may be 0 retries with the existing logic
One question I have is:
- Is it the code's intention for agent_endpoint_retries to ignore Cluster metadata request failures?
Just for style:
- Could you squash the commits into 1 or 2? That will help the cherry picking
- You may want to remove a stray blank line on Line 1619.
tag, *new_val, ctx->agent_endpoint_retries); | ||
flb_free(new_val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section looks good now
plugins/filter_ecs/ecs.c
Outdated
if (ret != -1) { | ||
if (val >= FLB_ECS_FILTER_METADATA_RETRIES) { | ||
if (*val >= ctx->agent_endpoint_retries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like if agent_endpoint_retries is set to 1, ecs filter may actually only try 1 time, where it should try 2 times.
val is the number of failed metadata requests
On first failure, the failed_metadata_requests count is set to 1
See lines L1434-L1436
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, thanks, it should be >
@@ -1467,7 +1535,7 @@ static int cb_ecs_filter(const void *data, size_t bytes, | |||
if (check == FLB_TRUE) { | |||
flb_plg_debug(ctx->ins, "Failed to get ECS Metadata for tag %s %d times. " | |||
"Will not attempt to retry the metadata request. Will attach cluster metadata only.", | |||
tag, FLB_ECS_FILTER_METADATA_RETRIES); | |||
tag, ctx->agent_endpoint_retries); | |||
} | |||
|
|||
if (check == FLB_FALSE && ctx->cluster_metadata_only == FLB_FALSE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tag metadata retrieval does not fail, it would be nice for the retries counter to reset back to 0 or get deleted.
Currently it may be that total retries are being tracked where we only want to count consecutive retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, maybe on success you never will try to get the metadata ever again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is ttl on the hash table so i think that there will be more retries after successful retrieval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea on success we will add the container ID and its metadata to the hash table. Container IDs are static for the lifetime of a container.
|
||
/* this test is mainly for leak checking on error, not for checking result record */ | ||
expected.expected_records = 1; /* 1 record */ | ||
expected.expected_pattern = "cluster"; /* cluster key should be added */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears cluster name and task name come from different metadata calls and can fail independantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirmed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
@@ -308,6 +403,10 @@ static void flb_test_ecs_filter_cluster_error() | |||
sleep(2); | |||
TEST_CHECK(expected.actual_records == expected.expected_records); | |||
filter_test_destroy(ctx); | |||
|
|||
/* unset env */ | |||
setenv("FLB_ECS_PLUGIN_UNDER_TEST", "", 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should a fail pattern be added here to ensure cluster name is not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not essential
msgpack_pack_str_body(&tmp_pck, | ||
keypair->key, | ||
len); | ||
len = flb_sds_len(keypair->val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will value always exist? I noticed that in the deletion code you check if value exists before deleting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other comment, I suppose it does always exist. the code is fine, though I'm not sure why in some places val == null is checked before use and other places it is not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, these are pre-processed metadata keypairs, so they are guaranteed to have non-null values.
msgpack_sbuffer_destroy(&tmp_sbuf); | ||
return FLB_FILTER_NOTOUCH; | ||
if (metadata_buffer->keypairs_len > 0) { | ||
mk_list_foreach_safe(head, tmp, &metadata_buffer->metadata_keypairs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great optimization around using record accessor unnecessarily
if (flb_sds_len(val) == 0) { | ||
flb_plg_info(ctx->ins, "Translation failed for %s : %s for container ID: %s", | ||
metadata_key->key, metadata_key->template, meta->id); | ||
flb_sds_destroy(val); | ||
/* keep trying other keys*/ | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose value will never be null or length 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this was to address the other comment asking if we need to check some val==null condition as we do in other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhhh its null if the record accessor translation fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following you here.
"Number of retries for failed metadata requests to ECS Agent Introspection " | ||
"endpoint. The most common cause of failed metadata requests is that the " | ||
"container the metadata request was made for is not part of an ECS Task. " | ||
"Check if you have non-task containers and docker dual logging enabled." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also happen if agent is updated https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note. No action needed for this
plugins/filter_ecs/ecs.c
Outdated
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider omitting blank line
See this isn't needed I think because we only need the request for a container ID to succeed once. Container IDs never change during the container lifetime. Once we get the metadata we have it and will not make another request. Of course there is the hash table TTL but that applies to both the failure tracking and the stored metadata. So if we have say one failure for a tag, and then we get the metadata, later when the metadata entry expires, the failed tag hash entry will also already be expired.
Thanks, will fix it.
Yes. That one should only fail if the agent endpoint is not available. Right now the retries and failure tracking is mainly because you can have non-task containers which will never have metadata and will always fail. I don't see a use case for stopping the filter from retrying on cluster metadata. If there is one in the future we can add tracking for it. |
Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thank you for your followup comments, they make sense. Approving
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
fluent#6614 1.9 based branch Signed-off-by: Wesley Pettit <[email protected]>
This is rebased on top of: #6589
ECS Metadata Empty metadata fix and performance optimization
Problem Statement
The ECS Filter was first released in AWS for Fluent Bit 2.29.0. The first version works, but has a non-ideal experience when metadata lookups for a log fails, and the code is very inefficient.
Problem 1: Filter can attache metadata keypairs with empty values
When customers configure the ECS filter, they provide a list of metadata key value pairs they want added to logs. The filter looks up metadata from the ECS Agent introspection endpoint.
However, the metadata fetch can fail, for one of two reasons:
/var/lib/docker/containers
they will get logs for all containers running on that host. This is due in part to the recent Dockerdual logging featurewhich will always write a containers stdout/stderr to/var/lib/docker/containers
. While customers can disable dual logging in the Dockerdaemon.json
config file, it is enabled by default on the ECS optimized AMI and in all vanilla Docker installs. Thus, Fluent Bit would collect logs for the ECS Agent and any other containers that are not part of an ECS Task. When the ECS Filter tries to lookup these containers in the ECS Agent introspection endpoint, it will get an error response.When either of the above issues occur, the currently released ECS Filter will attach metadata values that are empty. For example, consider the following config.
If the above errors occur, a customer would get a log like:
Problem 2: Current ECS Filter is inefficient
If N is the number of metadata key value pairs the customers has configured. Currently, in the filter callback, the ECS Filter calls
flb_ra_translate
to template the metadata for every single log record it processes times N: https://github.com/fluent/fluent-bit/blob/v2.0.6/plugins/filter_ecs/ecs.c#L1533Each call requires going through the metadata msgpack object and at least one
flb_sds
string allocation.Root Cause of Problems
As explained above, the root cause is the inefficient code contributed in the first version of the filter.
flb_ra_translate
is called for every log record in the filter callback times the number of metadata keys: https://github.com/fluent/fluent-bit/blob/v2.0.6/plugins/filter_ecs/ecs.c#L1533Solution
Estimates
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.