Skip to content

Commit

Permalink
fix: reduce Fluent Bit network errors TDE-1016 (#378)
Browse files Browse the repository at this point in the history
#### Motivation

Fluent Bit is experiencing a lot of network errors connecting to
`logs.ap-southeast-2.amazonaws.com`. This amount of errors does increase
the log storage cost, see
#374.
This is a known issue for which Fluent Bit team made [some
recommendations to reduce
it](aws/aws-for-fluent-bit#340). This PR is
applying one of these recommendations and has been tested with success
on non prod.

#### Modification

- Remove [the patch](#374)
that stops sending Fluent Bit application logs to CloudWatch
- Set the Fluent Bit `keepalive idle timeout` to 4s (default is 1.5s)
following [the recommendations made
here](aws/aws-for-fluent-bit#340).

#### Checklist

- [ ] Tests updated - N/A
- [x] Docs updated
- [x] Issue linked in Title

---------

Co-authored-by: Victor Engmark <[email protected]>
  • Loading branch information
paulfouquet and l0b0 authored Apr 3, 2024
1 parent 9df31e0 commit 2d97fe4
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 6 deletions.
6 changes: 5 additions & 1 deletion docs/infrastructure/components/fluentbit.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,11 @@ The Fluent Bit application version is stored in `appVersion` but this is only he

## Troubleshooting

[Guide to Debugging Fluent Bit issues](https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md)
### Resources

- [Guide to Debugging Fluent Bit issues](https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md)
- [2023 High Impact Issues Notice/Catalogue Ticket](https://github.com/aws/aws-for-fluent-bit/issues/542)
- [Recommended Cloudwatch_Logs Configuration](https://github.com/aws/aws-for-fluent-bit/issues/340)

### Basic checks

Expand Down
11 changes: 6 additions & 5 deletions infra/charts/fluentbit.ts
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,12 @@ HC_Period 5
logGroupName: `/aws/eks/${props.clusterName}/logs`,
logGroupTemplate: `/aws/eks/${props.clusterName}/workload/$kubernetes['namespace_name']`,
logStreamPrefix: 'fb-',
/**
* Set the Fluent Bit idle timeout to 4 seconds.
* This helps reduce the rate of network errors in the logs.
* See: https://github.com/aws/aws-for-fluent-bit/issues/340
*/
extraOutputs: `net.keepalive_idle_timeout 4s`,
},
firehose: { enabled: false },
kinesis: { enabled: false },
Expand All @@ -93,11 +99,6 @@ HC_Period 5
{ key: 'karpenter.sh/capacity-type', operator: 'Equal', value: 'spot', effect: 'NoSchedule' },
{ key: 'kubernetes.io/arch', operator: 'Equal', value: 'arm64', effect: 'NoSchedule' },
],
/* To reduce the log volume being sent to CloudWatch (shipped to AWS s3 => storage cost),
tells Fluent Bit to not send the logs from the Fluent Bit application pods.
The Fluent Bit application pods have some (a lot!) network errors that are being logged.
*/
annotations: { 'fluentbit.io/exclude': 'true' },
},
});
}
Expand Down

0 comments on commit 2d97fe4

Please sign in to comment.