-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] NDArray
read path
#1817
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ ❗ Your organization needs to install the Codecov GitHub app to enable full functionality. see 110 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the caveat that I have not yet exensively looked at the C++ side of things, here is a pass on the Python side. Sorry for missing this yesterday!
wrapper: type[Wrapper[Any | Any | Any]] | ||
if self.mode == "r" and clib.SOMADataFrame.exists(entry.entry.uri): | ||
if self.mode == "r" and clib.SOMADataFrame.exists(uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably should have noticed this earlier, but: we probably need to be passing the context we have to this function, right? It might contain user credentials or other important configuration (e.g. endpoint locations)
@@ -105,7 +106,8 @@ def shape(self) -> Tuple[int, ...]: | |||
Lifecycle: | |||
Experimental. | |||
""" | |||
return cast(Tuple[int, ...], tuple(self._soma_reader().shape)) | |||
handle: SOMAArray = self._handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is only tangentially related to this line, so:
Given that _handle
is now something different for all of the TileDB Array–based types, it’s probably worth it to pull that generic specialization out of TileDBArray
and put it into each of the concrete ones (so this becomes a TileDBArray[SOMAArray]
, or however the wrapper type generic stuff works because I kind of forget, etc.).
This isn’t critical, so if it ends up being a mess I wouldn’t worry about it but if it is reasonably straightforward it’s worth considering.
|
||
def column_to_enumeration(self, name: str) -> str: | ||
return str(self._soma_reader().get_enum_label_on_attr(name)) | ||
def enumeration(self, name: str) -> Optional[Tuple[Any, ...]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid use of Any
, it may be better to call this Tuple[object, ...]
. Doing so would have the advantage that the caller would need to either check or assert that it gets the returned type it wants, which is safer. But also it has the disadvantage that the caller would need to either check or assert that it gets the returned type it wants, which is kind of annoying.
): | ||
raise SOMAError( | ||
f"cannot open {hdl.uri!r}: a {type(hdl._handle)}" | ||
f" cannot be converted to a {typename}" | ||
) | ||
print(typename, cls, type(hdl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray debug print
handle: clib.DataFrameWrapper = self._handle | ||
return cast(int, handle.count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could some of these be avoided by making a libtiledbsoma.pyi
file? That would also give us some tooling help in editors. Up to you, and even if you do want to, it’s not something that needs to happen in this change.
to_clib_result_order = { | ||
options.ResultOrder.AUTO: clib.ResultOrder.automatic, | ||
options.ResultOrder.ROW_MAJOR: clib.ResultOrder.rowmajor, | ||
options.ResultOrder.COLUMN_MAJOR: clib.ResultOrder.colmajor, | ||
"auto": clib.ResultOrder.automatic, | ||
"row-major": clib.ResultOrder.rowmajor, | ||
"column-major": clib.ResultOrder.colmajor, | ||
} | ||
if result_order not in to_clib_result_order: | ||
raise ValueError(f"Invalid result_order: {result_order}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to go with the above it looks like this could be pulled into a function.
{k: str(v) for k, v in context.tiledb_config.items()}, | ||
[], | ||
clib.ResultOrder.automatic, | ||
(0, timestamp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add names to these arguments? The floating []
makes me nervous because I have no idea what it belongs to.
(0, timestamp), | ||
) | ||
class SOMAArrayWrapper(Wrapper[SOMAArray]): | ||
"""Wrapper for Array-derived SOMAObject classes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add ARRAY_IMPL: Type[SOMAArray]
here…
@classmethod | ||
def _opener( | ||
cls, | ||
uri: str, | ||
mode: options.OpenMode, | ||
context: SOMATileDBContext, | ||
timestamp: int, | ||
) -> clib.SOMADataFrame: | ||
open_mode = clib.OpenMode.read if mode == "r" else clib.OpenMode.write | ||
return clib.SOMADataFrame.open( | ||
uri, | ||
open_mode, | ||
{k: str(v) for k, v in context.tiledb_config.items()}, | ||
[], | ||
clib.ResultOrder.automatic, | ||
(0, timestamp), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…and move this to SOMAArrayWrapper
, replacing clib.SOMADataFrame
with cls.ARRAY_IMPL
, and set ARRAY_IMPL = SOMADataFrame
here…
@classmethod | ||
def _opener( | ||
cls, | ||
uri: str, | ||
mode: options.OpenMode, | ||
context: SOMATileDBContext, | ||
timestamp: int, | ||
) -> clib.SOMASparseNDArray: | ||
open_mode = clib.OpenMode.read if mode == "r" else clib.OpenMode.write | ||
return clib.SOMASparseNDArray.open( | ||
uri, | ||
open_mode, | ||
{k: str(v) for k, v in context.tiledb_config.items()}, | ||
[], | ||
clib.ResultOrder.automatic, | ||
(0, timestamp), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…then we can eliminate all this duplicated code (by similarly setting an ARRAY_IMPL
here and below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nguyenv the PR also does a major refactoring on the pybind11 files which is not mentioned in the description! Would that stay as a part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in the description for the PR above which is higher prescendent than this PR. I should mark this PR as draft for now as I'm dealing with segfault issues arising from the newly blockchain iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will definitely help you with reorganizing the reindexer code if that is your main concern with the Pybind11 refactoring.
d5f4dc5
to
8f7951e
Compare
a6dc282
to
98d069d
Compare
* When opening a `DataFrame` in read-mode, use `DataFrameWrapper` which wraps around `clib.SOMADataFrame`. Otherwise, `DataFrame` should use the already existing write-path with `ArrayWrapper` which wraps around a TileDB-Py Array * Necessary changes to `_dataframe.py` to support the read-path already exist on another branch. That branch will be merged into this PR shortly
* Take care of formatting / typing * Correct datetime domains * Get full nonempty domains for `SOMADataFrame` * Find missing open that needs to use `DataframeWrapper`
* Move `PyQueryCondition` Into `common.h` * Use Pyarrow Schema instead of TileDB ArraySchema * Remove TileDB-Py dependency * No longer requires attr-to-enum mapping passed for dictionaries as this can be checked in Pyarrow Schema now
* Eventually the `arrow_schema` calls should replace `schema` but quite a few things still depend on the TileDB ArraySchema so this is going to be temporarily punted for now
98d069d
to
9268d35
Compare
8f7951e
to
0934313
Compare
21a16a0
to
6a12ef2
Compare
03e7a61
to
53cca7e
Compare
Issue and/or context:
To be merged on top of #1793.
Changes:
SparseNDArray
in read-mode, useSparseNDArrayWrapper
which wraps aroundclib.SOMASparseNDArray
DenseNDArray
in read-mode, useDenseNDArrayWrapper
which wraps aroundclib.SOMADenseNDArray
SOMAArray
SOMAArray
with the newSOMADataFrame
,SOMASparseNDArray
, andSOMADenseNDArray
classes, we can reorganize the C++ install headers so that they no longer install internal headers