From 12185c6a8bab9a4690c9877f0a0aa897391f5ebe Mon Sep 17 00:00:00 2001 From: Fujimoto Seiji Date: Thu, 16 Dec 2021 10:32:13 +0900 Subject: [PATCH] out_s3: add documentation for Apache Arrow support This is a new feature of upcoming Fluent Bit v1.8 release. * The Amazon S3 plugin now can store incoming data in Apache Arrow format. * This support is very convenient for data analysis and manipulation. e.g. Use Fluent Bit to send real-time system statistics to S3 and do time-series analysis using pandas. * For now, it needs a feature flag being turned on at compile time. Not enabled on a default build. Add documentation about the support, and explain how to make use of the feature. Signed-off-by: Fujimoto Seiji --- pipeline/outputs/s3.md | 51 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/pipeline/outputs/s3.md b/pipeline/outputs/s3.md index 2cdf77c25..8616ed197 100644 --- a/pipeline/outputs/s3.md +++ b/pipeline/outputs/s3.md @@ -35,7 +35,7 @@ See [here](https://github.com/fluent/fluent-bit-docs/tree/43c4fe134611da471e706b | endpoint | Custom endpoint for the S3 API. | None | | sts_endpoint | Custom endpoint for the STS API. | None | | canned_acl | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | None | -| compression | Compression type for S3 objects. 'gzip' is currently the only supported value. The Content-Encoding HTTP Header will be set to 'gzip'. Compression can be enabled when `use_put_object` is on. | None | +| compression | Compression type for S3 objects. 'gzip' is currently the only supported value. The Content-Encoding HTTP Header will be set to 'gzip'. Compression can be enabled when `use_put_object` is on. If Apache Arrow support was enabled at compile time, you can set 'arrow' to this option. | None | | content_type | A standard MIME type for the S3 object; this will be set as the Content-Type HTTP header. This option can be enabled when `use_put_object` is on. | None | | send_content_md5 | Send the Content-MD5 header with PutObject and UploadPart requests, as is required when Object Lock is enabled. | false | | auto_retry_requests | Immediately retry failed requests to AWS services once. This option does not affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which may help improve throughput when there are transient/random networking issues. | false | @@ -205,3 +205,52 @@ aws ssm get-parameters-by-path --path /aws/service/aws-for-fluent-bit/ ``` For more see [the AWS for Fluent Bit github repo](https://github.com/aws/aws-for-fluent-bit#public-images). + +## Advanced usage + +### Use Apache Arrow for in-memory data processing + +Starting from Fluent Bit v1.8, the Amazon S3 plugin includes the support for [Apache Arrow](https://arrow.apache.org/). The support is currently not enabled by default, as it depends on a shared version of `libarrow` as the prerequisite. + +To use this feature, `FLB_ARROW` must be turned on at compile time: + +```text +$ cd build/ +$ cmake -DFLB_ARROW=On .. +$ cmake --build . +``` + +Once compiled, Fluent Bit can upload incoming data to S3 in Apache Arrow format. For example: + +```text +[INPUT] + Name cpu + +[OUTPUT] + Name s3 + Bucket your-bucket-name + total_file_size 1M + use_put_object On + upload_timeout 60s + Compression arrow +``` + +As shown in this example, setting `Compression` to `arrow` makes Fluent Bit to convert payload into Apache Arrow format. + +The stored data is very easy to load, analyze and process using popular data processing tools (such as Python pandas, Apache Spark and Tensorflow). The following code uses `pyarrow` to analyze the uploaded data: + +```text +>>> import pyarrow.feather as feather +>>> import pyarrow.fs as fs +>>> +>>> s3 = fs.S3FileSystem() +>>> file = s3.open_input_file("my-bucket/fluent-bit-logs/cpu.0/2021/04/27/09/36/15-object969o67ZF") +>>> df = feather.read_feather(file) +>>> print(df.head()) + date cpu_p user_p system_p cpu0.p_cpu cpu0.p_user cpu0.p_system +0 2021-04-27T09:33:53.539346Z 1.0 1.0 0.0 1.0 1.0 0.0 +1 2021-04-27T09:33:54.539330Z 0.0 0.0 0.0 0.0 0.0 0.0 +2 2021-04-27T09:33:55.539305Z 1.0 0.0 1.0 1.0 0.0 1.0 +3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0 +4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0 +```