[python] `NDArray` read path #1817

nguyenv · 2023-10-23T12:57:05Z

Issue and/or context:

To be merged on top of #1793.

Changes:

When opening a SparseNDArray in read-mode, use SparseNDArrayWrapper which wraps around clib.SOMASparseNDArray
When opening a DenseNDArray in read-mode, use DenseNDArrayWrapper which wraps around clib.SOMADenseNDArray
Completely remove SOMAArray
When the R API replaces SOMAArray with the new SOMADataFrame, SOMASparseNDArray, and SOMADenseNDArray classes, we can reorganize the C++ install headers so that they no longer install internal headers

codecov-commenter · 2023-10-23T13:08:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

see 110 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

thetorpedodog

With the caveat that I have not yet exensively looked at the C++ side of things, here is a pass on the Python side. Sorry for missing this yesterday!

thetorpedodog · 2023-10-25T18:07:26Z

apis/python/src/tiledbsoma/_collection.py

            wrapper: type[Wrapper[Any | Any | Any]]
-            if self.mode == "r" and clib.SOMADataFrame.exists(entry.entry.uri):
+            if self.mode == "r" and clib.SOMADataFrame.exists(uri):


I probably should have noticed this earlier, but: we probably need to be passing the context we have to this function, right? It might contain user credentials or other important configuration (e.g. endpoint locations)

thetorpedodog · 2023-10-25T18:12:05Z

apis/python/src/tiledbsoma/_common_nd_array.py

@@ -105,7 +106,8 @@ def shape(self) -> Tuple[int, ...]:
        Lifecycle:
            Experimental.
        """
-        return cast(Tuple[int, ...], tuple(self._soma_reader().shape))
+        handle: SOMAArray = self._handle


This comment is only tangentially related to this line, so:

Given that _handle is now something different for all of the TileDB Array–based types, it’s probably worth it to pull that generic specialization out of TileDBArray and put it into each of the concrete ones (so this becomes a TileDBArray[SOMAArray], or however the wrapper type generic stuff works because I kind of forget, etc.).

This isn’t critical, so if it ends up being a mess I wouldn’t worry about it but if it is reasonably straightforward it’s worth considering.

thetorpedodog · 2023-10-25T18:14:55Z

apis/python/src/tiledbsoma/_dataframe.py

-
-    def column_to_enumeration(self, name: str) -> str:
-        return str(self._soma_reader().get_enum_label_on_attr(name))
+    def enumeration(self, name: str) -> Optional[Tuple[Any, ...]]:


To avoid use of Any, it may be better to call this Tuple[object, ...]. Doing so would have the advantage that the caller would need to either check or assert that it gets the returned type it wants, which is safer. But also it has the disadvantage that the caller would need to either check or assert that it gets the returned type it wants, which is kind of annoying.

thetorpedodog · 2023-10-25T18:15:20Z

apis/python/src/tiledbsoma/_factory.py

    ):
        raise SOMAError(
            f"cannot open {hdl.uri!r}: a {type(hdl._handle)}"
            f" cannot be converted to a {typename}"
        )
+    print(typename, cls, type(hdl))


stray debug print

thetorpedodog · 2023-10-25T18:18:32Z

apis/python/src/tiledbsoma/_dataframe.py

+        handle: clib.DataFrameWrapper = self._handle
+        return cast(int, handle.count)


Could some of these be avoided by making a libtiledbsoma.pyi file? That would also give us some tooling help in editors. Up to you, and even if you do want to, it’s not something that needs to happen in this change.

thetorpedodog · 2023-10-25T18:24:13Z

apis/python/src/tiledbsoma/_sparse_nd_array.py

+        to_clib_result_order = {
+            options.ResultOrder.AUTO: clib.ResultOrder.automatic,
+            options.ResultOrder.ROW_MAJOR: clib.ResultOrder.rowmajor,
+            options.ResultOrder.COLUMN_MAJOR: clib.ResultOrder.colmajor,
+            "auto": clib.ResultOrder.automatic,
+            "row-major": clib.ResultOrder.rowmajor,
+            "column-major": clib.ResultOrder.colmajor,
+        }
+        if result_order not in to_clib_result_order:
+            raise ValueError(f"Invalid result_order: {result_order}")


to go with the above it looks like this could be pulled into a function.

thetorpedodog · 2023-10-25T18:25:23Z

apis/python/src/tiledbsoma/_tdb_handles.py

+            {k: str(v) for k, v in context.tiledb_config.items()},
+            [],
+            clib.ResultOrder.automatic,
+            (0, timestamp),


Could you add names to these arguments? The floating [] makes me nervous because I have no idea what it belongs to.

thetorpedodog · 2023-10-25T18:26:51Z

apis/python/src/tiledbsoma/_tdb_handles.py

-            (0, timestamp),
-        )
+class SOMAArrayWrapper(Wrapper[SOMAArray]):
+    """Wrapper for Array-derived SOMAObject classes."""


If you add ARRAY_IMPL: Type[SOMAArray] here…

thetorpedodog · 2023-10-25T18:28:05Z

apis/python/src/tiledbsoma/_tdb_handles.py

+    @classmethod
+    def _opener(
+        cls,
+        uri: str,
+        mode: options.OpenMode,
+        context: SOMATileDBContext,
+        timestamp: int,
+    ) -> clib.SOMADataFrame:
+        open_mode = clib.OpenMode.read if mode == "r" else clib.OpenMode.write
+        return clib.SOMADataFrame.open(
+            uri,
+            open_mode,
+            {k: str(v) for k, v in context.tiledb_config.items()},
+            [],
+            clib.ResultOrder.automatic,
+            (0, timestamp),
+        )


…and move this to SOMAArrayWrapper, replacing clib.SOMADataFrame with cls.ARRAY_IMPL, and set ARRAY_IMPL = SOMADataFrame here…

thetorpedodog · 2023-10-25T18:28:43Z

apis/python/src/tiledbsoma/_tdb_handles.py

+    @classmethod
+    def _opener(
+        cls,
+        uri: str,
+        mode: options.OpenMode,
+        context: SOMATileDBContext,
+        timestamp: int,
+    ) -> clib.SOMASparseNDArray:
+        open_mode = clib.OpenMode.read if mode == "r" else clib.OpenMode.write
+        return clib.SOMASparseNDArray.open(
+            uri,
+            open_mode,
+            {k: str(v) for k, v in context.tiledb_config.items()},
+            [],
+            clib.ResultOrder.automatic,
+            (0, timestamp),
+        )


…then we can eliminate all this duplicated code (by similarly setting an ARRAY_IMPL here and below).

@nguyenv the PR also does a major refactoring on the pybind11 files which is not mentioned in the description! Would that stay as a part of this PR?

#1793

It is in the description for the PR above which is higher prescendent than this PR. I should mark this PR as draft for now as I'm dealing with segfault issues arising from the newly blockchain iterator.

I will definitely help you with reorganizing the reindexer code if that is your main concern with the Pybind11 refactoring.

* When opening a `DataFrame` in read-mode, use `DataFrameWrapper` which wraps around `clib.SOMADataFrame`. Otherwise, `DataFrame` should use the already existing write-path with `ArrayWrapper` which wraps around a TileDB-Py Array * Necessary changes to `_dataframe.py` to support the read-path already exist on another branch. That branch will be merged into this PR shortly

* Take care of formatting / typing * Correct datetime domains * Get full nonempty domains for `SOMADataFrame` * Find missing open that needs to use `DataframeWrapper`

* Move `PyQueryCondition` Into `common.h` * Use Pyarrow Schema instead of TileDB ArraySchema * Remove TileDB-Py dependency * No longer requires attr-to-enum mapping passed for dictionaries as this can be checked in Pyarrow Schema now

* Eventually the `arrow_schema` calls should replace `schema` but quite a few things still depend on the TileDB ArraySchema so this is going to be temporarily punted for now

nguyenv · 2024-02-13T18:03:33Z

Closing as these changes have now all been separated into more digestable PRs and are either already merged or ready to be reviewed.

#2124
#2126
#2129
#2132
#2133

nguyenv marked this pull request as ready for review October 23, 2023 18:29

nguyenv requested review from ihnorton, eddelbuettel, johnkerl, thetorpedodog, gspowley and beroy October 23, 2023 18:32

nguyenv marked this pull request as draft October 23, 2023 18:42

nguyenv marked this pull request as ready for review October 23, 2023 21:24

thetorpedodog reviewed Oct 25, 2023

View reviewed changes

nguyenv force-pushed the viviannguyen/python-read-path branch 4 times, most recently from d5f4dc5 to 8f7951e Compare November 9, 2023 22:29

nguyenv marked this pull request as draft November 14, 2023 22:31

nguyenv force-pushed the viviannguyen/ndarray-read-path branch from a6dc282 to 98d069d Compare December 5, 2023 17:47

nguyenv added 13 commits December 5, 2023 12:37

Add Methods to DataFrameWrapper and ArrayWrapper

45d294b

* Take care of formatting / typing * Correct datetime domains * Get full nonempty domains for `SOMADataFrame` * Find missing open that needs to use `DataframeWrapper`

Free Name in ArrowSchema

c6ef885

Add Dictionary Support For ArraySchema -> ArrowSchema

7923bbd

Refactor QueryCondition

6747e38

* Move `PyQueryCondition` Into `common.h` * Use Pyarrow Schema instead of TileDB ArraySchema * Remove TileDB-Py dependency * No longer requires attr-to-enum mapping passed for dictionaries as this can be checked in Pyarrow Schema now

Formatting

c85fe9b

SOMADataFrame --> DataframeWrapper

d4c185e

Add Pybind11 Common Code

dd7b65b

Reduce Usage of TileDB ArraySchema in Favor in Pyarrow Schema

31db924

Support All Enumeration Value Types in to_arrow

4b3ad74

Do TileDB ArraySchema to ArrowSchema Converstion in C++

a244b4d

* Eventually the `arrow_schema` calls should replace `schema` but quite a few things still depend on the TileDB ArraySchema so this is going to be temporarily punted for now

Correct exists Method For macos

4236ec5

"Correct" typing

bfa4384

nguyenv added 14 commits December 5, 2023 12:38

Fix major bug where dense array was actually opening as sparse array

79024f2

WIP post-rebase fixes

4a5310c

WIP

607172c

WIP

a5b3d12

WIP

3f6d023

WIP

8254985

WIP no segfault

66c1ed1

WIP

a5ecd42

WIP

cdfe2bf

WIP

12605bb

Use future to wait for query

14f189d

Separate submit_read and results

f5645a8

Fix segfault

b3902f3

Correct typing

9268d35

nguyenv force-pushed the viviannguyen/ndarray-read-path branch from 98d069d to 9268d35 Compare December 5, 2023 18:46

Modify fs path string

9c1aa9c

nguyenv mentioned this pull request Dec 5, 2023

[c++] Modify ManagedQuery to perform async queries #1953

Merged

nguyenv force-pushed the viviannguyen/python-read-path branch from 8f7951e to 0934313 Compare December 16, 2023 17:51

nguyenv force-pushed the viviannguyen/python-read-path branch from 21a16a0 to 6a12ef2 Compare January 11, 2024 16:51

nguyenv force-pushed the viviannguyen/python-read-path branch 3 times, most recently from 03e7a61 to 53cca7e Compare January 19, 2024 19:47

Base automatically changed from viviannguyen/python-read-path to main January 19, 2024 21:14

nguyenv closed this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] `NDArray` read path #1817

[python] `NDArray` read path #1817

nguyenv commented Oct 23, 2023 •

edited

Loading

codecov-commenter commented Oct 23, 2023 •

edited

Loading

thetorpedodog left a comment

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

thetorpedodog Oct 25, 2023

beroy Nov 14, 2023

nguyenv Nov 14, 2023

nguyenv Nov 14, 2023

nguyenv commented Feb 13, 2024

		handle: clib.DataFrameWrapper = self._handle
		return cast(int, handle.count)

[python] NDArray read path #1817

[python] NDArray read path #1817

Conversation

nguyenv commented Oct 23, 2023 • edited Loading

codecov-commenter commented Oct 23, 2023 • edited Loading

Codecov Report

thetorpedodog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nguyenv commented Feb 13, 2024

[python] `NDArray` read path #1817

[python] `NDArray` read path #1817

nguyenv commented Oct 23, 2023 •

edited

Loading

codecov-commenter commented Oct 23, 2023 •

edited

Loading