Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial read_parquet MVP implementation #21

Merged
merged 4 commits into from
Apr 9, 2024
Merged

Initial read_parquet MVP implementation #21

merged 4 commits into from
Apr 9, 2024

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Apr 8, 2024

Change Description

Solution Description

This PR implements a read_parquet method that reads a base file and a set of nested files using a to_pack dictionary kwarg. As I was implementing this, I noticed some friction with needing to specify kwargs for the nested frame loading in the same call as the base frame. For now, I focused on just getting column subsets implemented by using a dictionary to assign to each nested frame. This will get messier as we move to things like Dask and needing to potentially handle partitioning kwargs as well.

I thought of an alternative to this which I posted in the pie in the sky doc, which may scale better to more complex loading tasks, but maybe not important for now:

# csv instead of parquet but same idea
df = read_csv("objects.csv", columns=some_subset)
df = df.pack_csv("dia_sources.csv", "dia", columns=dia_subset)
df = df.pack_csv("forced_sources.csv", "forced", columns=forced_subset)

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Build/CI Change Checklist

  • If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
  • If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

  • Any new or updated docstrings use the NumPy docstring format.
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover any changes
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

github-actions bot commented Apr 8, 2024

Before [00be464] After [a2794e0] Ratio Benchmark (Parameter)
1.18±1s 3.16±1s ~2.67 benchmarks.time_computation
3.16k 1.71k 0.54 benchmarks.mem_list

Click here to view all benchmarks.

Copy link

codecov bot commented Apr 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.93%. Comparing base (00be464) to head (041e513).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #21      +/-   ##
==========================================
+ Coverage   92.75%   92.93%   +0.17%     
==========================================
  Files          12       13       +1     
  Lines         552      566      +14     
==========================================
+ Hits          512      526      +14     
  Misses         40       40              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dougbrn dougbrn changed the title WIP: Initial read_parquet MVP implementation Initial read_parquet MVP implementation Apr 9, 2024
@dougbrn dougbrn marked this pull request as ready for review April 9, 2024 18:57
@dougbrn dougbrn requested a review from hombit April 9, 2024 18:57
Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

src/nested_pandas/nestedframe/io.py Outdated Show resolved Hide resolved
src/nested_pandas/nestedframe/io.py Show resolved Hide resolved
@dougbrn dougbrn merged commit 136f2c5 into main Apr 9, 2024
11 checks passed
@dougbrn dougbrn deleted the read_parquet branch April 9, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MVP: Wrapping read_parquet
2 participants