Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing nested frames to parquet files #83

Merged
merged 6 commits into from
May 16, 2024
Merged

Conversation

wilsonbb
Copy link
Contributor

@wilsonbb wilsonbb commented May 15, 2024

Change Description

Here we add a to_parquet method which serializes a NestedFrame as a parquet file, either on a "per-layer" basis where each layer is written as its own parquet file in a specified directory or as a single file where nested layers are already embedded within the columns. read_parquet is adjusted to handle either case by no longer requiring the to_pack argument.

We also add a helper method in generation.py to take write randomly generated data to parquet file(s).

Addresses #43

  • My PR includes a link to the issue that I am addressing

Solution Description

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

github-actions bot commented May 15, 2024

Before [282bc9b] After [24e3529] Ratio Benchmark (Parameter)
31.4±2ms 34.1±2ms 1.09 benchmarks.AssignSingleDfToNestedSeries.time_run
9.02±0.08ms 9.79±0.4ms 1.09 benchmarks.NestedFrameAddNested.time_run
59.6±4ms 62.8±2ms 1.05 benchmarks.ReassignHalfOfNestedSeries.time_run
5.16±0.03ms 5.34±0.2ms 1.03 benchmarks.NestedFrameReduce.time_run
6.39±0.07ms 6.54±0.2ms 1.02 benchmarks.NestedFrameQuery.time_run
255M 257M 1.01 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
271M 275M 1.01 benchmarks.ReassignHalfOfNestedSeries.peakmem_run
86.4M 86.4M 1 benchmarks.NestedFrameAddNested.peakmem_run
89.5M 89.7M 1 benchmarks.NestedFrameQuery.peakmem_run
89.5M 89.7M 1 benchmarks.NestedFrameReduce.peakmem_run

Click here to view all benchmarks.

Copy link

codecov bot commented May 15, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 98.68%. Comparing base (282bc9b) to head (0439298).
Report is 73 commits behind head on main.

Files Patch % Lines
src/nested_pandas/datasets/generation.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #83      +/-   ##
==========================================
+ Coverage   98.65%   98.68%   +0.02%     
==========================================
  Files          15       15              
  Lines         818      836      +18     
==========================================
+ Hits          807      825      +18     
  Misses         11       11              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wilsonbb wilsonbb marked this pull request as ready for review May 16, 2024 21:51
@wilsonbb wilsonbb requested a review from dougbrn May 16, 2024 21:51
Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I just have one question on what the output for by_layer with multiple partitions looks like/should look like

src/nested_pandas/nestedframe/core.py Show resolved Hide resolved
@wilsonbb wilsonbb merged commit 3dea29f into main May 16, 2024
10 of 11 checks passed
@wilsonbb wilsonbb deleted the from_parquet branch May 16, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants