Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_s3: add Apache Arrow support #3184

Closed
wants to merge 1 commit into from

Conversation

fujimotos
Copy link
Member

Apache Arrow is an efficient columnar data format that is suitable
for statistical analysis, and popular in machine learning community.

With this patch merged, users now can specify 'arrow' as the
compression type like this:

[OUTPUT]
  Name s3
  Bucket some-bucket
  total_file_size 1M
  use_put_object On
  compression arrow

which makes Fluent Bit convert the request buffer into Apache Arrow
format before uploading.

Signed-off-by: Fujimoto Seiji [email protected]

@fujimotos
Copy link
Member Author

Arrow support is disabled by default because not every server has the required
dependency libarrow. To enable the support, we need to compile it as follows:

$ cmake .. -GFLB_ARROW=On
$ cmake --build .

Here is an example that shows how one can utilize Apache Arrow support.

Configuration

[INPUT]
  Name cpu 

[OUTPUT]
  Name s3
  Match *
  Region ap-northeast-1
  Bucket fluent-bit-20210308
  total_file_size 1M
  use_put_object On
  upload_timeout 1m
  Compression arrow

Result

Now the uploaded data can be loaded instantly via Arrow's S3 interface.

https://arrow.apache.org/docs/python/filesystems.html

For example, the above configuration produces a very clean tabular time-series
data like this:

>>> import pyarrow as pa
>>> table = load_data_from_s3()
>>> print(table)
                           date  cpu_p  user_p  system_p  cpu0.p_cpu  cpu0.p_user  cpu0.p_system
0   2021-03-08T09:03:03.668251Z    0.0     0.0       0.0         0.0          0.0            0.0 
1   2021-03-08T09:03:04.668156Z    1.0     1.0       0.0         1.0          1.0            0.0 
2   2021-03-08T09:03:05.668242Z    0.0     0.0       0.0         0.0          0.0            0.0 
3   2021-03-08T09:03:06.668269Z    0.0     0.0       0.0         0.0          0.0            0.0 
4   2021-03-08T09:03:07.668218Z    0.0     0.0       0.0         0.0          0.0            0.0 
5   2021-03-08T09:03:08.739886Z    2.0     1.0       1.0         2.0          1.0            1.0 
6   2021-03-08T09:03:09.668181Z    1.0     1.0       0.0         1.0          1.0            0.0 
7   2021-03-08T09:03:10.668247Z    1.0     0.0       1.0         1.0          0.0            1.0 
8   2021-03-08T09:03:11.668182Z    2.0     2.0       0.0         2.0          2.0            0.0 
9   2021-03-08T09:03:12.668275Z    1.0     0.0       1.0         1.0          0.0            1.0 
10  2021-03-08T09:03:13.668428Z    0.0     0.0       0.0         0.0          0.0            0.0 
11  2021-03-08T09:03:14.668320Z    2.0     2.0       0.0         2.0          2.0            0.0 
12  2021-03-08T09:03:15.668256Z    0.0     0.0       0.0         0.0          0.0            0.0 
13  2021-03-08T09:03:16.668287Z    0.0     0.0       0.0         0.0          0.0            0.0 
14  2021-03-08T09:03:17.668307Z    1.0     1.0       0.0         1.0          1.0            0.0 
15  2021-03-08T09:03:18.668257Z    0.0     0.0       0.0         0.0          0.0            0.0 
16  2021-03-08T09:03:19.668281Z    0.0     0.0       0.0         0.0          0.0            0.0 
17  2021-03-08T09:03:20.668317Z    0.0     0.0       0.0         0.0          0.0            0.0 
18  2021-03-08T09:03:21.668231Z    0.0     0.0       0.0         0.0          0.0            0.0 
19  2021-03-08T09:03:22.668222Z    0.0     0.0       0.0         0.0          0.0            0.0 

@fujimotos fujimotos requested a review from PettitWesley March 8, 2021 10:00
@PettitWesley
Copy link
Contributor

CC @zhonghui12

@agup006
Copy link
Member

agup006 commented Mar 10, 2021

I'm wondering if we can just add the build requirements to the build server, seems like it could help with this feature adoption.

I'm thinking of how the tensorflow filter is rarely used because a user needs to build with specific settings on in order to enable it.

@PettitWesley
Copy link
Contributor

+1 to @agup006 comment/question- will this be included by default in the upstream distro/build?

PettitWesley
PettitWesley previously approved these changes Mar 10, 2021
Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the question of gating this behind a build flag that defaults to false, the code LGTM

plugins/out_s3/s3.c Outdated Show resolved Hide resolved
plugins/out_s3/s3.c Outdated Show resolved Hide resolved
plugins/out_s3/s3.c Outdated Show resolved Hide resolved
@fujimotos
Copy link
Member Author

I'm wondering if we can just add the build requirements to the build server, seems like it could help with this feature adoption.
I'm thinking of how the tensorflow filter is rarely used because a user needs to build with specific settings on in order to enable it.

@kou Do you have any comments on this point?

The basic problem here is that enabling Arrow support makes Fluent Bit executable
dependent on libarrow, but I don't think most servers have the library installed.

@kou
Copy link

kou commented Mar 10, 2021

If we enable Apache Arrow support by default, we should use Apache Arrow C++ directly instead of using via Apache Arrow GLib. If we use Apache Arrow GLib, we need libarrow, libarrow-glib and libglib.
(And we can avoid performance overhead by Apache Arrow GLib.)

We can use libarrow.a with ExternalProject_Add and -DARROW_BUILD_STATIC=ON CMAKE_ARGS as fallback when users don't have installed libarrow.

@fujimotos
Copy link
Member Author

@edsiper I updated this patch accordingly.

BTW, I discussed with @kou about enabling Apache Arrow support by default:
It is difficult, as of Mar 2021, because libarrow is implemented in C++, and we
cannot easily statically link it against Fluent Bit (which is C).

There is a future plan to create a C-friendly library within Apache Arrow project.
Until that statically-linkable library is available, the current implementation is
the best we can do.

CMakeLists.txt Outdated Show resolved Hide resolved
plugins/out_s3/arrow/compress.c Outdated Show resolved Hide resolved
@fujimotos fujimotos force-pushed the sf/s3-arrow-stg branch 3 times, most recently from 7a2c4db to 50443bb Compare March 24, 2021 04:11
@fujimotos fujimotos changed the title out_s3: Add Apache Arrow support out_s3: add Apache Arrow support Mar 24, 2021
@fujimotos
Copy link
Member Author

@kou I applied your feedback. Please approve this PR if you are fine.

kou
kou previously approved these changes Mar 24, 2021
plugins/out_s3/arrow/compress.c Outdated Show resolved Hide resolved
plugins/out_s3/arrow/compress.c Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@PettitWesley
Copy link
Contributor

@fujimotos what is needed to get this feature finished?

@fujimotos
Copy link
Member Author

@PettitWesley Sorry. I've been a bit absent from Fluent Bit recently,
working on Fluentd-side of stuffs (I'm also a maintainer of that project).

I'm gonna find some time today to finish this PR! So WFM.

@fujimotos fujimotos dismissed stale reviews from kou and PettitWesley via fdb7147 April 27, 2021 07:47
@fujimotos
Copy link
Member Author

Could you also replace ctx->compression != NULL with ctx->compression != 0 here?

@kou Thank you! I have fixed that.

kou
kou previously approved these changes Apr 27, 2021
Copy link

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@fujimotos
Copy link
Member Author

@PettitWesley I believe this PR is mergeable now.

@fujimotos
Copy link
Member Author

fluent/fluent-bit-docs/pull/523 is the documentation patch for the feature.

@PettitWesley
Copy link
Contributor

@fujimotos Awesome!

@edsiper You requested changes; can you approve/re-review so we can get this merged?

@PettitWesley PettitWesley requested a review from edsiper April 27, 2021 19:37
@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label May 28, 2021
@fujimotos fujimotos removed the Stale label May 28, 2021
plugins/out_s3/arrow/compress.c Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Jun 28, 2021
@fujimotos fujimotos removed the Stale label Jul 6, 2021
@PettitWesley
Copy link
Contributor

@edsiper @fujimotos What do we need to get this merged?

@edsiper
Copy link
Member

edsiper commented Jul 19, 2021

@PettitWesley

  • get rid of g_print() function (replace by flb_plg...())
  • fix conflicts
  • @PettitWesley approve the changes once listed items above are fixed

Apache Arrow is an efficient columnar data format that is suitable
for statistical analysis, and popular in machine learning community.

    https://arrow.apache.org/

With this patch merged, users now can specify 'arrow' as the
compression type like this:

    [OUTPUT]
      Name s3
      Bucket some-bucket
      total_file_size 1M
      use_put_object On
      Compression arrow

which makes Fluent Bit convert the request buffer into Apache Arrow
format before uploading.

Signed-off-by: Fujimoto Seiji <[email protected]>
@fujimotos
Copy link
Member Author

fujimotos commented Jul 19, 2021

@PettitWesley @edsiper Sorry for being late. I submit a update e210a45.

  • get rid of g_print() function

I cleaned up g_print() functions from my implementation.

  • fix conflicts

It's now cleanly mergiable with master.

  • @PettitWesley approve the changes once listed items above are fixed

If it seems okay for @PettitWesley, let's merge this branch.

@fujimotos
Copy link
Member Author

fujimotos commented Jul 19, 2021

ADDENDUM: I can confirm e210a45 fine with the latest version of Apache Arrow (4.0)
So this patch should be good in terms of features.

Screenshot of actual data (S3/Arrow)
スクリーンショット_2021-07-20_08-42-05

Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fujimotos Thanks!

@fujimotos
Copy link
Member Author

I merged this PR into mainline via 544fa89.

Thanks for all who reviewed and helped this PR. Close this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants