Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended Cloudwatch_Logs Configuration #340

Open
matthewfala opened this issue May 3, 2022 · 2 comments
Open

Recommended Cloudwatch_Logs Configuration #340

matthewfala opened this issue May 3, 2022 · 2 comments
Labels
guidance Customer is seeking guidance from us/the community

Comments

@matthewfala
Copy link
Contributor

matthewfala commented May 3, 2022

NOTICE: please see the main tracking ticket for multiple recently reported high impact issues in AWS for Fluent Bit: #542

Recommended Cloudwatch_Logs Configuration

Recently our team has received lots of inquires on tuning the cloudwatch_logs output plugin via it's configuration

Customers using out of tune cloudwatch configurations may experience log loss due to:

  • Broken connection / network errors
  • Lack of retries on batch failures
  • Lack of immediate network retries on network failure

These issues can be resolved via appropriate configuration.

If you are configuring FireLens via a Fluent Bit config file, use the following cloudwatch_logs configuration:

[OUTPUT]
    # general cloudwatch_logs configuration (nothing special here, customize to fit your use case)
    Name                cloudwatch_logs
    Match               ApplicationLogs
    region              ${LOG_REGION}
    log_group_name      ${SERVICE_NAME}-ApplicationLogs
    log_stream_prefix   ApplicationLogs--${HOSTNAME}
    auto_create_group   On

    # if you want to only write the log string without container metadata fields
    log_key             log

    # from aws-for-fluent-bit v2.32.0 and on, to support higher throughput logging,
    # set workers to a high value such as 5 or the number of cores on your host
    workers             1

    # optimized cloudwatch_logs output configuration
    # delayed retries on error 
    retry_limit         5    
    # on is default
    net.keepalive On
    # CW uses 6s idle timeout, FLB has 1.5s timer to check conns.
    # 4s ensures FLB always closes the conn itself, which we found 
    # significantly reduces the rate of network error messages it outputs
    net.keepalive_idle_timeout 4s

If you are configuring FireLens via task definition logDriver configuration options:

"logConfiguration": {
	"logDriver":"awsfirelens",
	"options": {

// general cloudwatch_logs configuration (nothing special here, customize to fit your use case)
		"Name": "cloudwatch_logs",
		"region": "${LOG_REGION}",
		"log_group_name": "${SERVICE_NAME}-ApplicationLogs",
		"log_stream_prefix": "ApplicationLogs--${HOSTNAME}",
		"auto_create_group": "On",
		"log_key": "log",

// optimized cloudwatch_logs output configuration
		"workers": "1",
		"auto_retry_requests": "On",
		"retry_limit": "5"
	}
}

We may update the above configuration from time to time to reflect the cloudwatch_logs configuration that provides the best performance.

@PettitWesley
Copy link
Contributor

These settings used to be in the example but are no longer since they are same as the defaults since 1.9.x Fluent Bit upstream version series:

    # create a separate thread for each cloudwatch_output (does not work with more than one worker per log stream due to cloudwatch_logs API concurrency limitations)
    # as of Fluent Bit 1.9, 1 worker is the default
    workers             1   
    # retry network requests immediately on failure
    # this setting also defaults to "On" in the 1.9 series. 
    auto_retry_requests On  

@Duplo-Yashwant
Copy link

Is there dual options is supported under logConfiguration? Like sending logs to cloudwatch as well as opensearch.

github-merge-queue bot pushed a commit to linz/topo-workflows that referenced this issue Apr 3, 2024
#### Motivation

Fluent Bit is experiencing a lot of network errors connecting to
`logs.ap-southeast-2.amazonaws.com`. This amount of errors does increase
the log storage cost, see
#374.
This is a known issue for which Fluent Bit team made [some
recommendations to reduce
it](aws/aws-for-fluent-bit#340). This PR is
applying one of these recommendations and has been tested with success
on non prod.

#### Modification

- Remove [the patch](#374)
that stops sending Fluent Bit application logs to CloudWatch
- Set the Fluent Bit `keepalive idle timeout` to 4s (default is 1.5s)
following [the recommendations made
here](aws/aws-for-fluent-bit#340).

#### Checklist

- [ ] Tests updated - N/A
- [x] Docs updated
- [x] Issue linked in Title

---------

Co-authored-by: Victor Engmark <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Customer is seeking guidance from us/the community
Projects
None yet
Development

No branches or pull requests

3 participants