Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] NDArray read path #1817

Closed
wants to merge 62 commits into from
Closed

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Oct 23, 2023

Issue and/or context:

To be merged on top of #1793.

Changes:

  • When opening a SparseNDArray in read-mode, use SparseNDArrayWrapper which wraps around clib.SOMASparseNDArray
  • When opening a DenseNDArray in read-mode, use DenseNDArrayWrapper which wraps around clib.SOMADenseNDArray
  • Completely remove SOMAArray
  • When the R API replaces SOMAArray with the new SOMADataFrame, SOMASparseNDArray, and SOMADenseNDArray classes, we can reorganize the C++ install headers so that they no longer install internal headers

@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

see 110 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@nguyenv nguyenv marked this pull request as ready for review October 23, 2023 18:29
@nguyenv nguyenv marked this pull request as draft October 23, 2023 18:42
@nguyenv nguyenv marked this pull request as ready for review October 23, 2023 21:24
Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the caveat that I have not yet exensively looked at the C++ side of things, here is a pass on the Python side. Sorry for missing this yesterday!

wrapper: type[Wrapper[Any | Any | Any]]
if self.mode == "r" and clib.SOMADataFrame.exists(entry.entry.uri):
if self.mode == "r" and clib.SOMADataFrame.exists(uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably should have noticed this earlier, but: we probably need to be passing the context we have to this function, right? It might contain user credentials or other important configuration (e.g. endpoint locations)

@@ -105,7 +106,8 @@ def shape(self) -> Tuple[int, ...]:
Lifecycle:
Experimental.
"""
return cast(Tuple[int, ...], tuple(self._soma_reader().shape))
handle: SOMAArray = self._handle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is only tangentially related to this line, so:

Given that _handle is now something different for all of the TileDB Array–based types, it’s probably worth it to pull that generic specialization out of TileDBArray and put it into each of the concrete ones (so this becomes a TileDBArray[SOMAArray], or however the wrapper type generic stuff works because I kind of forget, etc.).

This isn’t critical, so if it ends up being a mess I wouldn’t worry about it but if it is reasonably straightforward it’s worth considering.


def column_to_enumeration(self, name: str) -> str:
return str(self._soma_reader().get_enum_label_on_attr(name))
def enumeration(self, name: str) -> Optional[Tuple[Any, ...]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid use of Any, it may be better to call this Tuple[object, ...]. Doing so would have the advantage that the caller would need to either check or assert that it gets the returned type it wants, which is safer. But also it has the disadvantage that the caller would need to either check or assert that it gets the returned type it wants, which is kind of annoying.

):
raise SOMAError(
f"cannot open {hdl.uri!r}: a {type(hdl._handle)}"
f" cannot be converted to a {typename}"
)
print(typename, cls, type(hdl))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray debug print

Comment on lines 273 to 274
handle: clib.DataFrameWrapper = self._handle
return cast(int, handle.count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could some of these be avoided by making a libtiledbsoma.pyi file? That would also give us some tooling help in editors. Up to you, and even if you do want to, it’s not something that needs to happen in this change.

Comment on lines 148 to 157
to_clib_result_order = {
options.ResultOrder.AUTO: clib.ResultOrder.automatic,
options.ResultOrder.ROW_MAJOR: clib.ResultOrder.rowmajor,
options.ResultOrder.COLUMN_MAJOR: clib.ResultOrder.colmajor,
"auto": clib.ResultOrder.automatic,
"row-major": clib.ResultOrder.rowmajor,
"column-major": clib.ResultOrder.colmajor,
}
if result_order not in to_clib_result_order:
raise ValueError(f"Invalid result_order: {result_order}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to go with the above it looks like this could be pulled into a function.

Comment on lines 384 to 436
{k: str(v) for k, v in context.tiledb_config.items()},
[],
clib.ResultOrder.automatic,
(0, timestamp),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add names to these arguments? The floating [] makes me nervous because I have no idea what it belongs to.

(0, timestamp),
)
class SOMAArrayWrapper(Wrapper[SOMAArray]):
"""Wrapper for Array-derived SOMAObject classes."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add ARRAY_IMPL: Type[SOMAArray] here…

Comment on lines 372 to 437
@classmethod
def _opener(
cls,
uri: str,
mode: options.OpenMode,
context: SOMATileDBContext,
timestamp: int,
) -> clib.SOMADataFrame:
open_mode = clib.OpenMode.read if mode == "r" else clib.OpenMode.write
return clib.SOMADataFrame.open(
uri,
open_mode,
{k: str(v) for k, v in context.tiledb_config.items()},
[],
clib.ResultOrder.automatic,
(0, timestamp),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…and move this to SOMAArrayWrapper, replacing clib.SOMADataFrame with cls.ARRAY_IMPL, and set ARRAY_IMPL = SOMADataFrame here…

Comment on lines 398 to 466
@classmethod
def _opener(
cls,
uri: str,
mode: options.OpenMode,
context: SOMATileDBContext,
timestamp: int,
) -> clib.SOMASparseNDArray:
open_mode = clib.OpenMode.read if mode == "r" else clib.OpenMode.write
return clib.SOMASparseNDArray.open(
uri,
open_mode,
{k: str(v) for k, v in context.tiledb_config.items()},
[],
clib.ResultOrder.automatic,
(0, timestamp),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…then we can eliminate all this duplicated code (by similarly setting an ARRAY_IMPL here and below).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyenv the PR also does a major refactoring on the pybind11 files which is not mentioned in the description! Would that stay as a part of this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1793

It is in the description for the PR above which is higher prescendent than this PR. I should mark this PR as draft for now as I'm dealing with segfault issues arising from the newly blockchain iterator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will definitely help you with reorganizing the reindexer code if that is your main concern with the Pybind11 refactoring.

@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch 4 times, most recently from d5f4dc5 to 8f7951e Compare November 9, 2023 22:29
@nguyenv nguyenv marked this pull request as draft November 14, 2023 22:31
@nguyenv nguyenv force-pushed the viviannguyen/ndarray-read-path branch from a6dc282 to 98d069d Compare December 5, 2023 17:47
* When opening a `DataFrame` in read-mode, use `DataFrameWrapper` which
  wraps around `clib.SOMADataFrame`. Otherwise, `DataFrame` should
  use the already existing write-path with `ArrayWrapper` which wraps
  around a TileDB-Py Array
* Necessary changes to `_dataframe.py` to support the read-path already
  exist on another branch. That branch will be merged into this PR
  shortly
* Take care of formatting / typing
* Correct datetime domains
* Get full nonempty domains for `SOMADataFrame`
* Find missing open that needs to use `DataframeWrapper`
* Move `PyQueryCondition` Into `common.h`
* Use Pyarrow Schema instead of TileDB ArraySchema
* Remove TileDB-Py dependency
* No longer requires attr-to-enum mapping passed for dictionaries as
  this can be checked in Pyarrow Schema now
* Eventually the `arrow_schema` calls should replace `schema` but quite
  a few things still depend on the TileDB ArraySchema so this is going
  to be temporarily punted for now
@nguyenv nguyenv force-pushed the viviannguyen/ndarray-read-path branch from 98d069d to 9268d35 Compare December 5, 2023 18:46
@nguyenv
Copy link
Member Author

nguyenv commented Feb 13, 2024

Closing as these changes have now all been separated into more digestable PRs and are either already merged or ready to be reviewed.

#2124
#2126
#2129
#2132
#2133

@nguyenv nguyenv closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants