-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New BatchSchema class to generate patch selectors and combine into batch selectors #132
Conversation
Codecov Report
@@ Coverage Diff @@
## main #132 +/- ##
==========================================
- Coverage 98.37% 97.23% -1.15%
==========================================
Files 6 6
Lines 246 325 +79
Branches 49 68 +19
==========================================
+ Hits 242 316 +74
- Misses 4 6 +2
- Partials 0 3 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@jhamman this xbatcher refactor that we discussed over many meetings is now complete. It would be fantastic to hear any thoughts on the PR, but I also recognize that you are busy and on your own time now. Thanks for your brainstorming on these issues! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work here @maxrjones! Super detailed examples. I don't have any meaningful comments though.
pytest.param( | ||
False, | ||
marks=pytest.mark.xfail( | ||
reason="Bug described in https://github.com/xarray-contrib/xbatcher/issues/126" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know you can add xfail within a parameterized test, very cool!
@maxrjones, this looks great overall! i really like the graphics in #132 (comment). Having some of these graphics included in the docs would be awesome! |
Co-authored-by: Anderson Banihirwe <[email protected]>
Description of proposed changes
This PR is a major overhaul of xbatcher’s internals. Users should not notice any external changes, apart from bug fixes and some performance differences.
New BatchSchema class
In v0.2.0, information about the batch_selectors was stored as a dict in the
._batch_selectors
attribute of theBatchGenerator
class. This PR introducesBatchSchema
which contains all necessary information about the batch selectors.BatchGenerator
now creates an instance ofBatchSchema
, which handles the creation ofselectors
object. The purpose ofBatchGenerator
is to create batches (xarray Datasets/DataArrays) from those selector objects. This opens up possibilities including serializing/deserializing the BatchSchema (e.g.,BatchSchema.to_json()
andBatchSchema.from_json()
), cachingBatchSchema
objects separate from caching batches, and relatedly applying oneBatchSchema
instance to multiple xarray datasets.The representation of the batch selectors (
BatchSchema.selectors
) is the same regardless of whetherconcat_input_dims
isTrue
orFalse
(in contrast to v0.2.0 where the type varied based on that parameter). The selectors object is still a dict with keys representing the batch index and values representing the batch selector. The batch selector is a list of dicts, where the keys are the dimensions included inbatch_dims
and/orinput_dims
and the values are slices used to index the xarray dataset/dataarray. It’s simplest to refer to the subset of data created by each of these dictionaries as a patch. Ifconcat_input_dims
is False, each list has length 1 (i.e., 1 patch per batch). Ifconcat_input_dims
is True, the lists can contain multiple patches, which will be concatenated to form a batch by the generator.Bug fixes
Handling batch_dims (#131, #121)
As explained #121 (comment), this issue stemmed from information about the
batch_dims
indices not being stored in the selectors object. After this PR, each patch contains information about all dims that are indexed on (i.e., any dim included ininput_dims
orbatch_dims
). Those patches are then combined into batches based on the number of items in the list for each batch selector.Handling of overlapping samples
In v.2.0, the
batch_dims
were sliced first and subsequentlyindex_dims
were sliced. This had the consequence of missing away any patch that overlapped two batches. Now, the smallest patches created by the combination ofbatch_dims
andinput_dims
are defined first and these are then combined into batches based on the starting index (see simple example below).Performance changes
Running the benchmarks from #140 on main,
4d8e2c84a2d405e237f60f1df5286dd766e06ff0
, and this PR revealed a mix of performance increases and decreases. Based on a quick investigation, I think that the regressions mostly relate to loading each batch individually, whereas before this PR if a dimension was included inbatch_dims
andinput_dims
, the data associated with the slice frombatch_dims
would be loaded even ifconcat_input_dims
was False and that slice shouldn’t matter. Since loading more data improves performance, those benchmarks are now slower. Also, the handling of overlapping samples means there are more patches which will impact performance. In contrast, the benchmarks affected by #121 are now much faster.To-Do:
Note - One test fails (marked with xfail) due to #126 which will be addressed separately
Fixes #131
Fixes #121
Fixes #30