-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save to parquet #343
Save to parquet #343
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #343 +/- ##
==========================================
+ Coverage 94.66% 94.99% +0.32%
==========================================
Files 24 24
Lines 1557 1638 +81
==========================================
+ Hits 1474 1556 +82
+ Misses 83 82 -1 ☔ View full report in Codecov by Sentry. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
src/tape/ensemble.py
Outdated
elif additional_frames is False: | ||
frames_to_save = ["object", "source"] # save just object and source | ||
elif isinstance(additional_frames, Iterable): | ||
frames_to_save = [frame for frame in additional_frames if frame in list(self.frames.keys())] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can use sets here and give the user a more helpful error message
frames_to_save = set(additional_frames)
invalid_frames = frames_to_save.difference(set(self.frames.keys())
# Raise error and tell user the invalid frames
if len(invalid_frames) == 0:
...
frames_to_save.update(["object", "source"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented in latest commit
know how to work with the directory whether or not the object | ||
subdirectory is present. | ||
|
||
Be careful about repeated saves to the same directory name. This will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how liberally jupyter users may hit "run all cells", I think this could lead to some easy mistakes.
What about having an "overwrite" parameter?
If True, we will first try to delete the save directory (possibly printing a message if we did so)
If False, we can throw an exception if the save directory already exists noting they should use a different name or set "overwrite=True"
My instinct is that the default value of the parameter should be False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so we have access to overwrite already via the **kwargs passed along to all the to_parquet calls. What remains is that that will only overwrite any subdirectories (ensemble/object, ensemble/source, ensemble/result, etc.) that were present in both saves. So any previous subdirectories will not be deleted if a new save doesn't include them. I'm really worried about doing any form of delete, just as someone could specify a non-empty directory and I wouldn't want to delete anything else.
So for 1. Maybe the overwrite parameter should graduate out of the **kwargs for visibility? And 2. I'm not sure but maybe we could have save write some kind of subdirs metadata file that lets TAPE know which subdirectories it made, so that overwrite can clean those directories up in subsequent overwrite-enabled saves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wilson and I talked offline a bit on this. The strategy that we agreed on is to generate a metadata file that knows which subdirectories were produced by the save command. Successive save commands will look for this file and use it to locate which subdirectories in the directory should be removed during the save operation. In particular, we'd be looking for any subdirectories created by a previous save command that will not be overwritten by the current save command. Parquet will handle overwrites for conflicting directories, but we will need ensemble.save_ensemble to catch and remove these frame subdirectories that would otherwise not be touched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit implements this, with the tweak that I ended up opting to just clean all the previous subdirectories up first as it was logically easier.
src/tape/ensemble.py
Outdated
|
||
# Check for whether or not object is present, it's not saved when no columns are present | ||
if "object" in os.listdir(dirpath): | ||
self.from_parquet(src_path, obj_path, column_mapper=column_mapper, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we always use sync_tables=False
since we synced object and source before saving?
Also as a general question that may be out of scope for this PR, are we preserving divisions here? If so I guess it would be helpful to use sorted=True
if we knew that information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can always have sync_tables=False
, right, good point!
For divisions, these are not being preserved at the moment, which definitely seems like an issue. We can save an ensemble without divisions being known, so the sorted parameter will need to be dependent on that information.
Similar to above, maybe there's a need for a metadata file that should be generated to indicate whether divisions are known that can be read in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed another commit to better expose the sort/sorted kwargs from ensemble.from_parquet so they are usable in these cases. Some kind of metadata may be preferable here still so the user doesn't need to specify at all.
Edit: No longer accurate, refer to next comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another update, I now have this working with metadata. ensemble.save_ensemble now stores a boolean per frame in the metadata.json file indicating whether that frame had divisions set. Instead of using the sort/sorted kwargs, I've changed ensemble.from_ensemble to use this to determine whether to calculate divisions or not. In the case of object and source, this is just by setting the sorted flag or not. For the other frames, the sorted flag is only applicable to the set_index call that ensemble.from_parquet folds into. So this PR now has the frames generate a parquet metadata file which populates divisions information if it's available, and the reading of additional frames now uses the parquet metadata file.
I experimented with trying to get our ensemble.from_parquet to use these _metadata files, but it would be a non-trivial change in the logic that felt out of scope for this PR. It's potentially worth logging for the future as using the _metadata files would avoid having to calculate the min/max values per partition to get divisions, but on the other hand these files can get quite large for large datasets so they may have issues at scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some questions and suggestions but overall looks good to me! Thanks Doug!
src/tape/ensemble.py
Outdated
# Save a ColumnMapper file | ||
col_map = self.make_column_map() | ||
np.save(os.path.join(ens_path, "column_mapper.npy"), col_map.map) | ||
# np.save(os.path.join(ens_path, "column_mapper.npy"), col_map.map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we left this line in :)
src/tape/ensemble.py
Outdated
@@ -1284,19 +1286,32 @@ def save_ensemble(self, path=".", dirname="ensemble", additional_frames=True, ** | |||
# Determine the path | |||
ens_path = os.path.join(path, dirname) | |||
|
|||
# First look for an existing metadata.json file in the path | |||
try: | |||
with open(os.path.join(ens_path, "metadata.json"), "r") as oldfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can probably remove filename "metadata.json" into METADATA_FILENAME constant or equivalent
Also maybe a more descriptive name like "ensemble_metadata.json" might be less likely to clash with random files in some user's directory
src/tape/ensemble.py
Outdated
|
||
# Check for whether or not object is present, it's not saved when no columns are present | ||
if "object" in os.listdir(dirpath): | ||
self.from_parquet(src_path, obj_path, column_mapper=column_mapper, **kwargs) | ||
if "object" in subdirs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: forgot to reference this earlier, but we can use the SOURCE_FRAME_LABEL and OBJECT_FRAME_LABEL instead of "object" and "source" everywhere
old_subdirs = old_metadata["subdirs"] | ||
# Delete any old subdirectories | ||
for subdir in old_subdirs: | ||
shutil.rmtree(os.path.join(ens_path, subdir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not able to delete a subdirectory, do we want to explicitly handle the exception and provide additional logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that whatever error shutils would provide would probably be as descriptive as anything we'd provide? Also not sure how easy it would be to test this, setting permissions mid-test seems messy...
src/tape/ensemble.py
Outdated
} | ||
json_metadata = json.dumps(metadata, indent=4) | ||
|
||
with open(os.path.join(ens_path, "metadata.json"), "w") as outfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should write the new metadata file before doing any saving to parquet.
My thinking is that in the case where:
- user wants to save a large ensemble
- some dataframes have been saved
- a crash error occurs while saving a subsequent result dataframe
- user runs save_ensemble later in the same folder but with differently named result dataframes
old subdirectories won't be cleaned up since no metadata file was written. Admittedly a bit of an edge case and also a minor pain since we would have to loop over the frames/subdirs twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the parquet saving to the very end of the function in latest commit
if npartitions and npartitions > 1: | ||
source_frame = source_frame.repartition(npartitions=npartitions) | ||
elif partition_size: | ||
source_frame = source_frame.repartition(partition_size=partition_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just add a partition_size pytest parameter and we can get rid of the code coverage warning here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding two conftest.py functions (one for from_parquet and one for from_dask_dataframes) as I think all the parameters for these loader functions live there
dirpath, | ||
additional_frames=True, | ||
column_mapper=None, | ||
dask_client=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move dask_client to reflect Parameter order in the doc string (or vice versa)
src/tape/ensemble_readers.py
Outdated
sorted: bool, optional | ||
If the index column is already sorted in increasing order. | ||
Defaults to False | ||
sort: `bool`, optional | ||
If True, sorts the DataFrame by the id column. Otherwise set the | ||
index on the individual existing partitions. Defaults to False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want sort
and sorted
to be explicit parameters in the function definition? Or did we mean to remove these from the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, forgot to remove these from ensemble_readers.py as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! :)
Change Description
Resolves #151.
Solution Description
Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist
New Feature Checklist
Documentation Change Checklist
Build/CI Change Checklist
Other Change Checklist