Initial read_parquet MVP implementation #21

dougbrn · 2024-04-08T20:00:24Z

Change Description

My PR includes a link to the issue that I am addressing
Resolves MVP: Wrapping read_parquet #12.

Solution Description

This PR implements a read_parquet method that reads a base file and a set of nested files using a to_pack dictionary kwarg. As I was implementing this, I noticed some friction with needing to specify kwargs for the nested frame loading in the same call as the base frame. For now, I focused on just getting column subsets implemented by using a dictionary to assign to each nested frame. This will get messier as we move to things like Dask and needing to potentially handle partitioning kwargs as well.

I thought of an alternative to this which I posted in the pie in the sky doc, which may scale better to more complex loading tasks, but maybe not important for now:

# csv instead of parquet but same idea
df = read_csv("objects.csv", columns=some_subset)
df = df.pack_csv("dia_sources.csv", "dia", columns=dia_subset)
df = df.pack_csv("forced_sources.csv", "forced", columns=forced_subset)

Code Quality

I have read the Contribution Guide
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

My fix includes a new test that breaks as a result of the bug (if possible)
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

I have added or updated the docstrings associated with my feature using the NumPy docstring format
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover my new feature
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Any updated docstrings use the NumPy docstring format

Build/CI Change Checklist

If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

Any new or updated docstrings use the NumPy docstring format.
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover any changes
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

github-actions · 2024-04-08T20:02:36Z

Before [`00be464`]	After [`a2794e0`]	Ratio	Benchmark (Parameter)
1.18±1s	3.16±1s	~2.67	benchmarks.time_computation
3.16k	1.71k	0.54	benchmarks.mem_list

Click here to view all benchmarks.

codecov · 2024-04-08T20:03:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.93%. Comparing base (00be464) to head (041e513).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #21      +/-   ##
==========================================
+ Coverage   92.75%   92.93%   +0.17%     
==========================================
  Files          12       13       +1     
  Lines         552      566      +14     
==========================================
+ Hits          512      526      +14     
  Misses         40       40

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hombit

Great, thank you!

src/nested_pandas/nestedframe/io.py

Initial read_parquet MVP implementation

b2832ea

dougbrn force-pushed the read_parquet branch from e1ef21f to b2832ea Compare April 9, 2024 17:01

add read_parquet test, tweak read_parquet

189c596

dougbrn changed the title ~~WIP: Initial read_parquet MVP implementation~~ Initial read_parquet MVP implementation Apr 9, 2024

dougbrn marked this pull request as ready for review April 9, 2024 18:57

dougbrn requested a review from hombit April 9, 2024 18:57

hombit approved these changes Apr 9, 2024

View reviewed changes

src/nested_pandas/nestedframe/io.py Outdated Show resolved Hide resolved

src/nested_pandas/nestedframe/io.py Show resolved Hide resolved

dougbrn added 2 commits April 9, 2024 12:30

engine and dtypes

7ec4bb5

add backend todo

041e513

dougbrn merged commit 136f2c5 into main Apr 9, 2024
11 checks passed

dougbrn deleted the read_parquet branch April 9, 2024 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial read_parquet MVP implementation #21

Initial read_parquet MVP implementation #21

dougbrn commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

codecov bot commented Apr 8, 2024 •

edited

Loading

hombit left a comment

Initial read_parquet MVP implementation #21

Initial read_parquet MVP implementation #21

Conversation

dougbrn commented Apr 8, 2024 • edited Loading

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

Bug Fix Checklist

New Feature Checklist

Documentation Change Checklist

Build/CI Change Checklist

Other Change Checklist

github-actions bot commented Apr 8, 2024 • edited Loading

codecov bot commented Apr 8, 2024 • edited Loading

Codecov Report

hombit left a comment

Choose a reason for hiding this comment

dougbrn commented Apr 8, 2024 •

edited

Loading

github-actions bot commented Apr 8, 2024 •

edited

Loading

codecov bot commented Apr 8, 2024 •

edited

Loading