-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Blockwise SciPy sparse and Arrow Table iterator #1792
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ ❗ Your organization needs to install the Codecov GitHub app to enable full functionality. see 40 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
Overall shape of the API looks good. I have some implementation-level suggestions but will save that for Actual Review Time. |
@thetorpedodog - feel free to make them now if low effort for you. I'm happy to keep refining the core implementation while we discuss the API. |
High level API/docs comment: it may be too easily missed that this is the way to page through complete rows/columns of the matrix (unlike the Arrow table iterator right beside it, that can yield ragged chunks). I think a new user would need to read the guts of the docstring to deduce that ( At the least, the first graf of the docstring could explicate this. I even wonder about splitting the |
IMHO, there is no reason we can't add the same (non-ragged) approach for Table iterators. I just felt that it was a separate feature request, so I deferred it for now. My implementation would facilitate it easily, as there is a stand-alone iterator already implemented which provides exactly this functionality for 2D arrays - just needs to be extended to nD, which I believe is straight forward. @aaronwolen @pablo-gar - do you guys want to see this enhancement in this PR? I can add, or we can push to future. It would likely look something like: # generate ragged (current implementation)
A.read().tables()
# generate row-major non-ragged
A. read().tables(axis=0, step=10000)
# generate col-major non-ragged
A.read().tables(axis=1, step=10000) The same constraints on the compatibility of As a late thought on the ragged result topic -- the SOMA read() methods already support ordering via the The primary reason this PR does not take that approach is for performance - the current TileDB ordered reader implementation exacts a (roughly) 50% read time penalty over unordered. But with a bit of work, the user already has the ability to get non-ragged results. |
Regarding:
I prefer the single signature as the defaults are almost always sufficient (in my experience and past usage). That said, I am very happy to change this if we have an alternative consensus. The other option is to simply add aliases (sugar) which call the base (most flexible) variant. |
Potential issuesI've had some time to digest it. Current potential issues I see:
SuggestionI have a suggestion that I think address the two comments above and which I think is future proof: 1. Adding def read(
self,
coords: options.SparseNDCoords = (),
step: int = None,
*,
result_order: options.ResultOrderStr = options.ResultOrder.AUTO,
batch_size: options.BatchSize = _UNBATCHED,
partitions: Optional[options.ReadPartitions] = None,
platform_config: Optional[PlatformConfig] = None,
) -> "SparseNDArrayRead":
"""
Reads a user-defined slice of the :class:`SparseNDArray`.
Args:
[...]
step:
When specified and `result_order` is "ROW_MAJOR" or "COLUMN_MAJOR",
the resulting read iterators will guarantee full row or column pagination, respectively.
When `None`, the resulting iterators produce ragged results per chunk.
[...]
""" 2. Removing
3. Implement row/column pagination for other iterators: Implement later and raise error for now if Predicted API functionality # generate ragged
A.read().coos()
A.read().dense_tensors()
A.read().tables()
A.read().scipy()
A.read(result_order = "ROW_MAJOR").tables()
...
# generate row-major non-ragged,
A. read(result_order = "ROW_MAJOR", step = 1000).scipy() # scipy.csr
A. read(result_order = "ROW_MAJOR", step = 1000).scipy(compress = False) # scipy.coo
A. read(result_order = "ROW_MAJOR", step = 1000).coos() # raise error if not yet implemented
A. read(result_order = "ROW_MAJOR", step = 1000).tables() # raise error if not yet implemented
...
# generate column-major non-ragged,
A. read(result_order = "COLUMN_MAJOR", step = 1000).scipy() # scipy.csc
A. read(result_order = "COLUMN_MAJOR", step = 1000).scipy(compress = False) # scipy.coo
A. read(result_order = "COLUMN_MAJOR", step = 1000).coos() # raise error if not yet implemented
A. read(result_order = "COLUMN_MAJOR", step = 1000).tables() # raise error if not yet implemented
...
# Incompatible, raise error
A. read(result_order = "AUTO", step = 1000) |
I prefer the While the |
There is a lot of confusion here that is (likely) the result of thinking about 2D matrics. Sort order and a step axis are entirely different concepts:
For 2D scipy, these concepts are quite close. For nD > 2D, and for formats that do not imply any particular point order (e.g., Table), result_order and step axis control entirely different things. |
this is not true in numpy, which is arguably the default (standard) for nd stuff in Python. Also note that it isn't true in Arrow. And Pandas is mostly "to_numpy", not "to_numpy_ndarray"/"to_numpy_matrix" |
@atolopko-czi - definitely a bug. A corner case I had missed. Fix and unit tests will be pushed shortly. Update: now fixed by 93d3233 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Code structure is solid and code is well tested. API is straight forward to use. A few final nits for your consideration. Approving optimistically, ahead of fix for #1792 (comment). Looking forward to using this!
axis: Union[int, Sequence[int]], | ||
*, | ||
size: Optional[Union[int, Sequence[int]]] = None, | ||
reindex_disable: Optional[Union[int, Sequence[int]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: what if this also accepted a bool
where True
maps to range(len(axis))
and False
maps to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a hard no, but I worry that doing so would overload the meaning of this parameter a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea, will add. more sugar, more unit tests :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ambivalent - but have it implemented. Who wants to decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested it because while test driving the API, I mistakenly assumed it was a bool flag. :) After RTFM, it made sense, but it seemed like a minor trap. If we want to keep typing as is, maybe consider renaming the param to something, e.g. reindex_disable_on_axis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atolopko-czi's reindex_disable_on_axis
suggestion seems like good middle ground. The arg name is a bit verbose but its intent is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see disabling re-indexing a bit more of an advanced functionality, and will also likely happen on one axis frequently. Let's not add more cognitive dissonance with bool
option. Re-naming the parameter is a +1.
Thus
reindex_disable_on_axis
should be the namereindex_disable_on_axis
should not take bool.
This is one of these cases where we can easily add the sugar in the future if needed, whereas removing it in the future can get complicated.
@atolopko-czi - fix for the bug you found is in the branch. See 93d3233 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this a lot! A few things which are mostly smaller concerns but it’s looking good.
_EagerRT = TypeVar("_EagerRT") | ||
|
||
def _if_eager( | ||
self, x: Iterator[_EagerRT], _pool: Optional[ThreadPoolExecutor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pool
here is always going to be self.pool
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I want the ability to override the self.pool with the context-managed pool, so it is definitely not the same. self.pool is whatever the user provided.
The goal of this is to allow:
- user provided thread pool, OR
- internally created one, which is cleaned up by the context exit
the latter is reliant on the generator/context integration
axis: Union[int, Sequence[int]], | ||
*, | ||
size: Optional[Union[int, Sequence[int]]] = None, | ||
reindex_disable: Optional[Union[int, Sequence[int]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a hard no, but I worry that doing so would overload the meaning of this parameter a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic. Docs are comprehensive and super clear. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few minor things but overall very good
self.joinids: List[pa.Array] = [ | ||
pa.array( | ||
np.concatenate( | ||
list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this list(
needed? I would expect numpy to take any iterable you give it (though could be wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your expectations is wrong - it takes a Sequence, not an interable.
demo:
In [4]: np.concatenate( (np.zeros((i,)) for i in range(4)) )
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [4], line 1
----> 1 np.concatenate( (np.zeros((i,)) for i in range(4)) )
File <__array_function__ internals>:200, in concatenate(*args, **kwargs)
TypeError: The first input argument needs to be a sequence
In [5]: np.concatenate( tuple(np.zeros((i,)) for i in range(4)) )
Out[5]: array([0., 0., 0., 0., 0., 0.])
* add flake8 config for IDE happiness * first cut at scipy iter for sparse array * cleanup * additional refinement of scipy iterator, plus unit tests * clean up typing and docstring * 3.7 support * refine docstring * cleanup * performance work * fix typo in docstring * remove debugging print * PR feedback on docstring * add prototype stepped table iterator * add result_order to stepped table iterator * major revision to proposed blockwise api * slim down unit test * initial set of PR review changes * clean up passing pool to generators * more refactoring from PR review f/b * even DRYer * PR review fixes * docstring revision based on PR feedback * fix bug found in PR review * additional PR review changes * more PR inspired changes * comment in response to PR f/b * more PR f/b * rename reindex_disable
🚀 |
* add flake8 config for IDE happiness * first cut at scipy iter for sparse array * cleanup * additional refinement of scipy iterator, plus unit tests * clean up typing and docstring * 3.7 support * refine docstring * cleanup * performance work * fix typo in docstring * remove debugging print * PR feedback on docstring * add prototype stepped table iterator * add result_order to stepped table iterator * major revision to proposed blockwise api * slim down unit test * initial set of PR review changes * clean up passing pool to generators * more refactoring from PR review f/b * even DRYer * PR review fixes * docstring revision based on PR feedback * fix bug found in PR review * additional PR review changes * more PR inspired changes * comment in response to PR f/b * more PR f/b * rename reindex_disable Co-authored-by: Bruce Martin <[email protected]>
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`: - `TRUE`: disable re-indexing on all axes - `FALSE: re-index on all axes - `NA`: re-index only on major axis, disable re-indexing on all axes (default) `BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python) `BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances: - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"` - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"` `repr` of `"T"` is allowed in all circumstances and continues to be the default Two new fields are available to blockwise iterators: - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed resolves #2671
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`: - `TRUE`: disable re-indexing on all axes - `FALSE: re-index on all axes - `NA`: re-index only on major axis, disable re-indexing on all axes (default) `BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python) `BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances: - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"` - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"` `repr` of `"T"` is allowed in all circumstances and continues to be the default Two new fields are available to blockwise iterators: - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed resolves #2671
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`: - `TRUE`: disable re-indexing on all axes - `FALSE: re-index on all axes - `NA`: re-index only on major axis, disable re-indexing on all axes (default) `BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python) `BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances: - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"` - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"` `repr` of `"T"` is allowed in all circumstances and continues to be the default Two new fields are available to blockwise iterators: - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed resolves #2671
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`: - `TRUE`: disable re-indexing on all axes - `FALSE: re-index on all axes - `NA`: re-index only on major axis, disable re-indexing on all axes (default) `BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python) `BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances: - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"` - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"` `repr` of `"T"` is allowed in all circumstances and continues to be the default Two new fields are available to blockwise iterators: - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed resolves #2671 Co-authored-by: Paul Hoffman <[email protected]>
Issue and/or context:
Fixes #1503
Changes:
Added blockwise SciPy and Arrow Table iterators for the SparseNDMatrix class. Also usable form from ExperimentAxisQuery.X() method. Example usage:
produces:
The API is implemented on SparseNDArray (not ExperimentAxisQuery), and can be used directly on the sparse array, e.g.,
In addition to the SciPy blockwise iterator, you can also blockwise iterate over Arrow Tables, with all the same controls over axis, reindexing, step size, etc.
The docstrings have more detail, but there are several parameters that control what is emitted by the iterator:
For example, to produce a CSC while stepping column (var)-wise:
Or a COO while stepping row (obs)-wise:
Notes for Reviewer: