Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_s3: add documentation for Apache Arrow support #523

Merged
merged 1 commit into from
Jan 12, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 50 additions & 1 deletion pipeline/outputs/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b
| endpoint | Custom endpoint for the S3 API. | None |
| sts_endpoint | Custom endpoint for the STS API. | None |
| canned_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None |
| compression | Compression type for S3 objects. 'gzip' is currently the only supported value. The Content-Encoding HTTP Header will be set to 'gzip'. Compression can be enabled when `use_put_object` is on. | None |
| compression | Compression type for S3 objects. 'gzip' is currently the only supported value. The Content-Encoding HTTP Header will be set to 'gzip'. Compression can be enabled when `use_put_object` is on. If Apache Arrow support was enabled at compile time, you can set 'arrow' to this option. | None |
| content_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. This option can be enabled when `use_put_object` is on. | None |
| send_content_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false |
| auto_retry_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | false |
Expand Down Expand Up @@ -205,3 +205,52 @@ aws ssm get-parameters-by-path --path /aws/service/aws-for-fluent-bit/
```

For more see [the AWS for Fluent Bit github repo](https://github.com/aws/aws-for-fluent-bit#public-images).

## Advanced usage

### Use Apache Arrow for in-memory data processing

Starting from Fluent Bit v1.8, the Amazon S3 plugin includes the support for [Apache Arrow](https://arrow.apache.org/). The support is currently not enabled by default, as it depends on a shared version of `libarrow` as the prerequisite.

To use this feature, `FLB_ARROW` must be turned on at compile time:

```text
$ cd build/
$ cmake -DFLB_ARROW=On ..
$ cmake --build .
```

Once compiled, Fluent Bit can upload incoming data to S3 in Apache Arrow format. For example:

```text
[INPUT]
Name cpu

[OUTPUT]
Name s3
Bucket your-bucket-name
total_file_size 1M
use_put_object On
upload_timeout 60s
Compression arrow
```

As shown in this example, setting `Compression` to `arrow` makes Fluent Bit to convert payload into Apache Arrow format.

The stored data is very easy to load, analyze and process using popular data processing tools (such as Python pandas, Apache Spark and Tensorflow). The following code uses `pyarrow` to analyze the uploaded data:

```text
>>> import pyarrow.feather as feather
>>> import pyarrow.fs as fs
>>>
>>> s3 = fs.S3FileSystem()
>>> file = s3.open_input_file("my-bucket/fluent-bit-logs/cpu.0/2021/04/27/09/36/15-object969o67ZF")
>>> df = feather.read_feather(file)
>>> print(df.head())
date cpu_p user_p system_p cpu0.p_cpu cpu0.p_user cpu0.p_system
0 2021-04-27T09:33:53.539346Z 1.0 1.0 0.0 1.0 1.0 0.0
1 2021-04-27T09:33:54.539330Z 0.0 0.0 0.0 0.0 0.0 0.0
2 2021-04-27T09:33:55.539305Z 1.0 0.0 1.0 1.0 0.0 1.0
3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0
4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0
```