Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Uproot transformer serialization #131

Closed
kyungeonchoi opened this issue May 11, 2020 · 2 comments
Closed

Improve Uproot transformer serialization #131

kyungeonchoi opened this issue May 11, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@kyungeonchoi
Copy link
Collaborator

Story

Current uproot transformer performs the following steps for the serialization of data:

  1. Awkward Table returned from generated_transformer.py
  2. Convert Awkward Table to Pandas DataFrame (DF)
  3. Convert Pandas DF to Arrow
  4. Save Arrow as Parquet or stream to Kafka

Remove unnecessary step(s) and implement the new method suggested by Jim Pivarski to improve the performance.

Acceptance Criteria

  1. Better performance of Uproot transformer
@kyungeonchoi kyungeonchoi added the enhancement New feature or request label May 11, 2020
@kyungeonchoi
Copy link
Collaborator Author

Awkward supports a direct conversion from awkward.Table to Arrow (awkward.toarrow), but the layout of awkward.Table returned by generated_transformer.py is not compatible with this function. It has to be top-level ChunkedArrays to be converted to Arrow.

Two solutions:

  1. Immediate: convert awkward.Table to awkward1.Table and then back to awkward.Table. This fixes the layout issue of the original awkward.Table. Now awkward.toarrow should work. awkward1 >= 0.2.19 supports this method.
  2. Near future: Uproot will produce awkward1.Table and awkward1 will produce Arrow (toarrow and fromarrow #68 scikit-hep/awkward#224)

Quick benchmark of Awkward Table to Arrow serialization for the 9904 (row) x 20 (column) Awkward table (tested at UC River):

  • Current: 7.95 sec
  • Immediate solution: 0.01 sec

Significant improvement is expected for large tables.

@jpivarski
Copy link

Although it's technically closed, work on scikit-hep/awkward#224 hasn't stopped. I've taken over for Anish and I'm touching up his work as scikit-hep/awkward#263. It should be ready pretty soon—these are the final dottings and crossings of i's and t's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants