Skip to content

Commit

Permalink
out_s3: add documentation for Apache Arrow support
Browse files Browse the repository at this point in the history
This is a new feature of upcoming Fluent Bit v1.8 release.

 * The Amazon S3 plugin now can store incoming data in Apache
   Arrow format.

 * This support is very convenient for data analysis and manipulation.

   e.g. Use Fluent Bit to send real-time system statistics to S3
        and do time-series analysis using pandas.

 * For now, it needs a feature flag being turned on at compile time.
   Not enabled on a default build.

Add documentation about the support, and explain how to make use
of the feature.

Signed-off-by: Fujimoto Seiji <[email protected]>
  • Loading branch information
fujimotos committed Dec 16, 2021
1 parent 51c90bb commit 12185c6
Showing 1 changed file with 50 additions and 1 deletion.
51 changes: 50 additions & 1 deletion pipeline/outputs/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b
| endpoint | Custom endpoint for the S3 API. | None |
| sts_endpoint | Custom endpoint for the STS API. | None |
| canned_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None |
| compression | Compression type for S3 objects. 'gzip' is currently the only supported value. The Content-Encoding HTTP Header will be set to 'gzip'. Compression can be enabled when `use_put_object` is on. | None |
| compression | Compression type for S3 objects. 'gzip' is currently the only supported value. The Content-Encoding HTTP Header will be set to 'gzip'. Compression can be enabled when `use_put_object` is on. If Apache Arrow support was enabled at compile time, you can set 'arrow' to this option. | None |
| content_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. This option can be enabled when `use_put_object` is on. | None |
| send_content_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false |
| auto_retry_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | false |
Expand Down Expand Up @@ -205,3 +205,52 @@ aws ssm get-parameters-by-path --path /aws/service/aws-for-fluent-bit/
```

For more see [the AWS for Fluent Bit github repo](https://github.com/aws/aws-for-fluent-bit#public-images).

## Advanced usage

### Use Apache Arrow for in-memory data processing

Starting from Fluent Bit v1.8, the Amazon S3 plugin includes the support for [Apache Arrow](https://arrow.apache.org/). The support is currently not enabled by default, as it depends on a shared version of `libarrow` as the prerequisite.

To use this feature, `FLB_ARROW` must be turned on at compile time:

```text
$ cd build/
$ cmake -DFLB_ARROW=On ..
$ cmake --build .
```

Once compiled, Fluent Bit can upload incoming data to S3 in Apache Arrow format. For example:

```text
[INPUT]
Name cpu
[OUTPUT]
Name s3
Bucket your-bucket-name
total_file_size 1M
use_put_object On
upload_timeout 60s
Compression arrow
```

As shown in this example, setting `Compression` to `arrow` makes Fluent Bit to convert payload into Apache Arrow format.

The stored data is very easy to load, analyze and process using popular data processing tools (such as Python pandas, Apache Spark and Tensorflow). The following code uses `pyarrow` to analyze the uploaded data:

```text
>>> import pyarrow.feather as feather
>>> import pyarrow.fs as fs
>>>
>>> s3 = fs.S3FileSystem()
>>> file = s3.open_input_file("my-bucket/fluent-bit-logs/cpu.0/2021/04/27/09/36/15-object969o67ZF")
>>> df = feather.read_feather(file)
>>> print(df.head())
date cpu_p user_p system_p cpu0.p_cpu cpu0.p_user cpu0.p_system
0 2021-04-27T09:33:53.539346Z 1.0 1.0 0.0 1.0 1.0 0.0
1 2021-04-27T09:33:54.539330Z 0.0 0.0 0.0 0.0 0.0 0.0
2 2021-04-27T09:33:55.539305Z 1.0 0.0 1.0 1.0 0.0 1.0
3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0
4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0
```

0 comments on commit 12185c6

Please sign in to comment.