description |
---|
Send logs, data, and metrics to Amazon S3 |
The Amazon S3 output plugin lets you ingest records into the S3 cloud object store.
The plugin can upload data to S3 using the
multipart upload API
or PutObject
.
Multipart is the default and is recommended. Fluent Bit will stream data in a series
of parts. This limits the amount of data buffered on disk at any point in time.
By default, every time 5 MiB of data have been received, a new part will be uploaded.
The plugin can create files up to gigabytes in size from many small chunks or parts
using the multipart API. All aspects of the upload process are configurable.
The plugin lets you specify a maximum file size, and a timeout for uploads. A file will be created in S3 when the maximum size or the timeout is reached, whichever comes first.
Records are stored in files in S3 as newline delimited JSON.
See AWS Credentials for details about fetching AWS credentials.
{% hint style="info" %} The Prometheus success/retry/error metrics values output by the built-in HTTP server in Fluent Bit are meaningless for S3 output. S3 has its own buffering and retry mechanisms. The Fluent Bit AWS S3 maintainers apologize for this feature gap; you can track issue progress on GitHub. {% endhint %}
Key | Description | Default |
---|---|---|
region |
The AWS region of your S3 bucket. | us-east-1 |
bucket |
S3 Bucket name | none |
json_date_key |
Specify the time key name in the output record. To disable the time key, set the value to false . |
date |
json_date_format |
Specify the format of the date. Accepted values: double , epoch , iso8601 (2018-05-30T09:39:52.000681Z), _java_sql_timestamp_ (2018-05-30 09:39:52.000681). |
iso8601 |
total_file_size |
Specify file size in S3. Minimum size is 1M . With use_put_object On the maximum size is 1G . With multipart uploads, the maximum size is 50G . |
100M |
upload_chunk_size |
The size of each part for multipart uploads. Max: 50M | 5,242,880 bytes |
upload_timeout |
When this amount of time elapses, Fluent Bit uploads and creates a new file in S3. Set to 60m to upload a new file every hour. |
10m |
store_dir |
Directory to locally buffer data before sending. When using multipart uploads, data buffers until reaching the upload_chunk_size . S3 stores metadata about in progress multipart uploads in this directory, allowing pending uploads to be completed if Fluent Bit stops and restarts. It stores the current $INDEX value if enabled in the S3 key format so the $INDEX keeps incrementing from its previous value after Fluent Bit restarts. |
/tmp/fluent-bit/s3 |
store_dir_limit_size |
Size limit for disk usage in S3. Limit theS3 buffers in the store_dir to limit disk usage. Use store_dir_limit_size instead of storage.total_limit_size which can be used for other plugins |
0 (unlimited) |
s3_key_format |
Format string for keys in S3. This option supports a UUID, strftime time formatters, a syntax for selecting parts of the Fluent log tag using a syntax inspired by the rewrite_tag filter. Add $UUID in the format string to insert a random string. Add $INDEX in the format string to insert an integer that increments each upload. The $INDEX value saves in the store_dir . Add $TAG in the format string to insert the full log tag. Add $TAG[0] to insert the first part of the tag in theS3 key. The tag is split into parts using the characters specified with the s3_key_format_tag_delimiters option. Add the extension directly after the last piece of the format string to insert a key suffix. To specify a key suffix in use_put_object mode, you must specify $UUID . See S3 Key Format. Time in s3_key is the timestamp of the first record in the S3 file. |
/fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/%S |
s3_key_format_tag_delimiters |
A series of characters used to split the tag into parts for use with s3_key_format . option. |
. |
static_file_path |
Disables behavior where UUID string appends to the end of the S3 key name when $UUID isn't provided in s3_key_format . $UUID , time formatters, $TAG , and other dynamic key formatters all work as expected while this feature is set to true. |
false |
use_put_object |
Use the S3 PutObject API instead of the multipart upload API. When enabled, the key extension is only available when $UUID is specified in s3_key_format . If $UUID isn't included, a random string appends format string and the key extension can't be customized. |
false |
role_arn |
ARN of an IAM role to assume (for example, for cross account access.) | none |
endpoint |
Custom endpoint for the S3 API. Endpoints can contain scheme and port. | none |
sts_endpoint |
Custom endpoint for the STS API. | none |
profile |
Option to specify an AWS Profile for credentials. | default |
canned_acl |
Predefined Canned ACL policy for S3 objects. | none |
compression |
Compression type for S3 objects. gzip is currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can use arrow . For gzip compression, the Content-Encoding HTTP Header will be set to gzip . Gzip compression can be enabled when use_put_object is on or off (PutObject and Multipart). Arrow compression can only be enabled with use_put_object On . |
none |
content_type |
A standard MIME type for the S3 object, set as the Content-Type HTTP header. | none |
send_content_md5 |
Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. |
false |
auto_retry_requests |
Immediately retry failed requests to AWS services once. This option doesn't affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which can help improve throughput during transient network issues. | true |
log_key |
By default, the whole log record will be sent to S3. When specifying a key name with this option, only the value of that key sends to S3. For example, when using Docker you can specify log_key log and only the log message sends to S3. |
none |
preserve_data_ordering |
When an upload request fails, the last received chunk might swap with a later chunk, resulting in data shuffling. This feature prevents shuffling by using a queue logic for uploads. | true |
storage_class |
Specify the storage class for S3 objects. If this option isn't specified, objects store with the default STANDARD storage class. |
none |
retry_limit |
Integer value to set the maximum number of retries allowed. Requires versions 1.9.10 and 2.0.1 or later. For previous version, the number of retries is 5 and isn't configurable. |
1 |
external_id |
Specify an external ID for the STS API. Can be used with the role_arn parameter if your role requires an external ID. |
none |
workers |
The number of workers to perform flush operations for this output. | 1 |
To skip TLS verification, set tls.verify
as false
. For more details about the
properties available and general configuration, refer to
TLS/SSL.
The plugin requires the following AWS IAM permissions:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": "*"
}]
}
The S3 output plugin is used to upload large files to an Amazon S3 bucket, while most other outputs which send many requests to upload data in batches of a few megabytes or less.
When Fluent Bit receives logs, it stores them in chunks, either in memory or the
filesystem depending on your settings. Chunks are usually around 2 MB in size.
Fluent Bit sends chunks, in order, to each output that matches their tag. Most outputs
then send the chunk immediately to their destination. A chunk is sent to the output's
flush
callback function, which must return one of FLB_OK
, FLB_RETRY
, or
FLB_ERROR
. Fluent Bit keeps count of the return values from each output's
flush
callback function. These counters are the data source for Fluent Bit error, retry,
and success metrics available in Prometheus format through its monitoring interface.
The S3 output plugin conforms to the Fluent Bit output plugin specification.
Since S3's use case is to upload large files (over 2 MB), its behavior is different.
S3's flush
callback function buffers the incoming chunk to the filesystem, and
returns an FLB_OK
. This means Prometheus metrics available from the Fluent
Bit HTTP server are meaningless for S3. In addition, the storage.total_limit_size
parameter isn't meaningful for S3 since it has its own buffering system in the
store_dir
. Instead, use store_dir_limit_size
. S3 requires a writeable filesystem.
Running Fluent Bit on a read-only filesystem won't work with the S3 output.
S3 uploads primarily initiate using the S3
timer
callback function, which runs separately from its flush
.
S3 has its own buffering system and its own callback to upload data, so the normal
sequential data ordering of chunks provided by the Fluent Bit engine can be
compromised. S3 has the presevere_data_ordering
option which ensures data is
uploaded in the original order it was collected by Fluent Bit.
- The HTTP Monitoring interface output metrics aren't meaningful for S3. AWS understands that this is non-ideal. See the open issue and design to allow S3 to manage its own output metrics.
- You must use
store_dir_limit_size
to limit the space on disk used by S3 buffer files. - The original ordering of data inputted to Fluent Bit might not be preserved unless you enable
preserve_data_ordering On
.
In Fluent Bit, all logs have an associated tag. The s3_key_format
option lets you
inject the tag into the S3 key using the following syntax:
$TAG
: The full tag.$TAG[n]
: The nth part of the tag (index starting at zero). This syntax is copied from the rewrite tag filter. By default, tag parts are separated with dots, but you can change this withs3_key_format_tag_delimiters
.
In the following example, assume the date is January 1st, 2020 00:00:00
and the tag
associated with the logs in question is my_app_name-logs.prod
.
[OUTPUT]
Name s3
Match *
bucket my-bucket
region us-west-2
total_file_size 250M
s3_key_format /$TAG[2]/$TAG[0]/%Y/%m/%d/%H/%M/%S/$UUID.gz
s3_key_format_tag_delimiters .-
With the delimiters as .
and -
, the tag splits into parts as follows:
$TAG[0]
=my_app_name
$TAG[1]
=logs
$TAG[2]
=prod
The key in S3 will be /prod/my_app_name/2020/01/01/00/00/00/bgdHN1NM.gz
.
The Fluent Bit S3 output was designed to ensure that previous uploads will never be
overwritten by a subsequent upload. The s3_key_format
supports time formatters,
$UUID
, and $INDEX
. $INDEX
is special because it's saved in the store_dir
. If
you restart Fluent Bit with the same disk, it can continue incrementing the
index from its last value in the previous run.
For files uploaded with the PutObject
API, the S3 output requires that a unique
random string be present in the S3 key. Many of the use cases for
PutObject
uploads involve a short time period between uploads, so a timestamp
in the S3 key might not be unique enough between uploads. For example, if you only
specify minute granularity timestamps in the S3 key, with a small upload size, it's
possible to have two uploads that have timestamps set in the same minute. This
requirement can be disabled with static_file_path On
.
The PutObject
API is used in these cases:
- When you explicitly set
use_put_object On
. - On startup when the S3 output finds old buffer files in the
store_dir
from a previous run and attempts to send all of them at once. - On shutdown. To prevent data loss the S3 output attempts to send all currently buffered data at once.
You should always specify $UUID
somewhere in your S3 key format. Otherwise, if the
PutObject
API is used, S3 appends a random eight-character UUID to the end of your
S3 key. This means that a file extension set at the end of an S3 key will have the
random UUID appended to it. Disabled this with static_file_path On
.
This example attempts to set a .gz
extension without specifying $UUID
:
[OUTPUT]
Name s3
Match *
bucket my-bucket
region us-west-2
total_file_size 50M
use_put_object Off
compression gzip
s3_key_format /$TAG/%Y/%m/%d/%H_%M_%S.gz
In the case where pending data is uploaded on shutdown, if the tag was app
, the S3
key in the S3 bucket might be:
/app/2022/12/25/00_00_00.gz-apwgylqg
The S3 output appended a random string to the file extension, since this upload
on shutdown used the PutObject
API.
There are two ways of disabling this behavior:
-
Use
static_file_path
:[OUTPUT] Name s3 Match * bucket my-bucket region us-west-2 total_file_size 50M use_put_object Off compression gzip s3_key_format /$TAG/%Y/%m/%d/%H_%M_%S.gz static_file_path On
-
Explicitly define where the random UUID will go in the S3 key format:
[OUTPUT] Name s3 Match * bucket my-bucket region us-west-2 total_file_size 50M use_put_object Off compression gzip s3_key_format /$TAG/%Y/%m/%d/%H_%M_%S/$UUID.gz
The store_dir
is used to temporarily store data before upload. If Fluent Bit
stops suddenly, it will try to send all data and complete all uploads before it
shuts down. If it can not send some data, on restart it will look in the store_dir
for existing data and try to send it.
Multipart uploads are ideal for most use cases because they allow the plugin to upload data in small chunks over time. For example, 1 GB file can be created from 200 5 MB chunks. While the file size in S3 will be 1 GB, only 5 MB will be buffered on disk at any one point in time.
One drawback to multipart uploads is that the file and data aren't visible in S3
until the upload is completed with a
CompleteMultipartUpload
call. The plugin attempts to make this call whenever Fluent Bit is shut down to
ensure your data is available in S3. It also stores metadata about each upload in
the store_dir
, ensuring that uploads can be completed when Fluent Bit restarts
(assuming it has access to persistent disk and the store_dir
files will still be
present on restart).
If you run Fluent Bit in an environment without persistent disk, or without the
ability to restart Fluent Bit and give it access to the data stored in the
store_dir
from previous executions, some considerations apply. This might occur if
you run Fluent Bit on AWS Fargate.
In these situations, Fluent Bits recommend using the PutObject
API and sending data
frequently, to avoid local buffering as much as possible. This will limit data loss
in the event Fluent Bit is killed unexpectedly.
The following settings are recommended for this use case:
[OUTPUT]
Name s3
Match *
bucket your-bucket
region us-east-1
total_file_size 1M
upload_timeout 1m
use_put_object On
With use_put_object Off
(default), S3 will attempt to send files using multipart
uploads. For each file, S3 first calls
CreateMultipartUpload,
then a series of calls to
UploadPart for
each fragment (targeted to be upload_chunk_size
bytes), and finally
CompleteMultipartUpload
to create the final file in S3.
S3 requires each UploadPart fragment to be at least 5,242,880 bytes, otherwise the upload is rejected.
The S3 output must sometimes fallback to the PutObject
API.
Uploads are triggered by these settings:
total_file_size
andupload_chunk_size
: When S3 has buffered data in thestore_dir
that meets the desiredtotal_file_size
(foruse_put_object On
) or theupload_chunk_size
(for Multipart), it will trigger an upload operation.upload_timeout
: Whenever locally buffered data has been present on the filesystem in thestore_dir
longer than the configuredupload_timeout
, it will be sent even when the desired byte size hasn't been reached. If you configure a smallupload_timeout
, your files can be smaller than thetotal_file_size
. The timeout is evaluated against the time at which S3 started buffering data for each unique tag (that is, the time when new data was buffered for the unique tag after the last upload). The timeout is also evaluated against the CreateMultipartUpload time, so a multipart upload will be completed afterupload_timeout
has elapsed, even if the desired size hasn't yet been reached.
If your upload_timeout
triggers an upload before the pending buffered data reaches
the upload_chunk_size
, it might be too small for a multipart upload. S3 will
fallback to use the PutObject
API.
When you enable compression, S3 applies the compression algorithm at send time. The size settings trigger uploads based on the size of buffered data, not the final compressed size. It's possible that after compression, buffered data no longer meets the required minimum S3 UploadPart size. If this occurs, you will see a log message like:
[ info] [output:s3:s3.0] Pre-compression upload_chunk_size= 5630650, After
compression, chunk is only 1063320 bytes, the chunk was too small, using PutObject to upload
If you encounter this frequently, use the numbers in the messages to guess your
compression factor. In this example, the buffered data was reduced from
5,630,650 bytes to 1,063,320 bytes. The compressed size is one-fifth the actual data size.
Configuring upload_chunk_size 30M
should ensure each part is large enough after
compression to be over the minimum required part size of 5,242,880 bytes.
The S3 API allows the last part in an upload to be less than the 5,242,880 byte minimum. If a part is too small for an existing upload, the S3 output will upload that part and then complete the upload.
The upload_timeout
evaluated against the
CreateMultipartUpload
time. A multipart upload will be completed after upload_timeout
elapses, even if
the desired size has not yet been reached.
When
CreateMultipartUpload
is called, an UploadID
is returned. S3 stores these IDs for active uploads in the
store_dir
. Until
CompleteMultipartUpload
is called, the uploaded data isn't visible in S3.
On shutdown, S3 output attempts to complete all pending uploads. If an upload fails
to complete, the ID remains buffered in the store_dir
in a directory called
multipart_upload_metadata
. If you restart the S3 output with the same store_dir
it will discover the old UploadIDs and complete the pending uploads. The S3
documentation
has suggestions on discovering and deleting or completing dangling uploads in your
buckets.
MinIO is a high-performance, S3 compatible object storage and you can build your app with S3 functionality without S3.
The following example runs a MinIO server
at localhost:9000
, and create a bucket of your-bucket
.
Example:
[OUTPUT]
Name s3
Match *
bucket your-bucket
endpoint http://localhost:9000
The records store in the MinIO server.
You can send your S3 output to Google. You must generate HMAC keys on GCS and use
those keys for access-key
and access-secret
.
Example:
[OUTPUT]
Name s3
Match *
bucket your-bucket
endpoint https://storage.googleapis.com
To send records into Amazon S3, you can run the plugin from the command line or through the configuration file.
The S3 plugin reads parameters from the command line through the -p
argument:
fluent-bit -i cpu -o s3 -p bucket=my-bucket -p region=us-west-2 -p -m '*' -f 1
In your main configuration file append the following Output
section:
[OUTPUT]
Name s3
Match *
bucket your-bucket
region us-east-1
store_dir /home/ec2-user/buffer
total_file_size 50M
upload_timeout 10m
An example using PutObject
instead of multipart:
[OUTPUT]
Name s3
Match *
bucket your-bucket
region us-east-1
store_dir /home/ec2-user/buffer
use_put_object On
total_file_size 10M
upload_timeout 10m
Amazon distributes a container image with Fluent Bit and plugins.
github.com/aws/aws-for-fluent-bit
Images are available in the Amazon ECR Public Gallery as aws-for-fluent-bit.
You can download images with different tags using the following command:
docker pull public.ecr.aws/aws-observability/aws-for-fluent-bit:<tag>
For example, you can pull the image with latest version with:
docker pull public.ecr.aws/aws-observability/aws-for-fluent-bit:latest
If you see errors for image pull limits, try signing in to public ECR with your AWS credentials:
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
See the Amazon ECR Public official documentation for more details.
amazon/aws-for-fluent-bit is also available from the Docker Hub.
Use Fluent Bit SSM Public Parameters to find the Amazon ECR image URI in your region:
aws ssm get-parameters-by-path --path /aws/service/aws-for-fluent-bit/
For more information, see the AWS for Fluent Bit GitHub repository.
With Fluent Bit v1.8 or greater, the Amazon S3 plugin includes the support for
Apache Arrow. Support isn't enabled by
default, and has a dependency on a shared version of libarrow
.
To use this feature, FLB_ARROW
must be turned on at compile time. Use the following
commands:
cd build/
cmake -DFLB_ARROW=On ..
cmake --build .
After being compiled, Fluent Bit can upload incoming data to S3 in Apache Arrow format.
For example:
[INPUT]
Name cpu
[OUTPUT]
Name s3
Bucket your-bucket-name
total_file_size 1M
use_put_object On
upload_timeout 60s
Compression arrow
Setting Compression
to arrow
makes Fluent Bit convert payload into Apache Arrow
format.
Load, analyze, and process stored data using popular data processing tools such as Python pandas, Apache Spark and Tensorflow.
The following example uses pyarrow
to analyze the uploaded data:
>>> import pyarrow.feather as feather
>>> import pyarrow.fs as fs
>>>
>>> s3 = fs.S3FileSystem()
>>> file = s3.open_input_file("my-bucket/fluent-bit-logs/cpu.0/2021/04/27/09/36/15-object969o67ZF")
>>> df = feather.read_feather(file)
>>> print(df.head())
date cpu_p user_p system_p cpu0.p_cpu cpu0.p_user cpu0.p_system
0 2021-04-27T09:33:53.539346Z 1.0 1.0 0.0 1.0 1.0 0.0
1 2021-04-27T09:33:54.539330Z 0.0 0.0 0.0 0.0 0.0 0.0
2 2021-04-27T09:33:55.539305Z 1.0 0.0 1.0 1.0 0.0 1.0
3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0
4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0