Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Dataframe read path #1793

Merged
merged 24 commits into from
Jan 19, 2024
Merged

[python] Dataframe read path #1793

merged 24 commits into from
Jan 19, 2024

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Oct 16, 2023

Issue and/or context:

Running Changes:

  • When opening a DataFrame in read-mode, use DataFrameWrapper which wraps around clib.SOMADataFrame. Otherwise, DataFrame should use the already existing write-path with ArrayWrapper which wraps around a TileDB-Py Array
  • Corrections to C++ unit tests where SOMADataFrame needs to use sparse array
  • SOMADataFrame::count needs to use nnz
  • C++ API create needs to set soma_object_type and exists needs to check soma_object_type
  • Get full nonempty domains for SOMADataFrame
  • Add getters and methods to DataFrameWrapper and ArrayWrapper (i.e nonempty domain, attr names, dim names, etc)
  • Support query conditions for SOMADataFrame
  • _query_condition.py no longer has dependency on TileDB-Py and uses Pyarrow schema instead of TileDB ArraySchema and SOMAError instead of TileDBError
  • Reorganize Pybind11 code to place util functions and PyQueryCondition (note: should refactor QueryCondition to completely remove PyQueryCondition usage) into common.h
  • SOMAArray now resides in its own soma_array.cc file
  • pytiledbsoma.cc is now the top-level file that contains stats bindings and loads modules from query_condition.cc, soma_array.cc, soma_dataframe.cc
  • Support enumerations / dictionaries for SOMADataFrame
  • Support all enumeration values in ArrowAdapter::to_arrow
  • read Pybind11 function now uses correct Arrow schema to convert dictionary arrays when importing from C to Python
  • SOMAObject::schema returns ArrowSchema which calls from SOMAArray::arrow_schema
  • Create C++ factory unique_ptr<SOMAObject> SOMAObject::open and bind in Pybind11 as clib.SOMAObject.open which returns the correct Python SOMA class
  • Fix for [python] Errors raised as RuntimeError should raise SOMAError #783

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2023

Codecov Report

Merging #1793 (53cca7e) into main (08931d3) will decrease coverage by 1.55%.
The diff coverage is 66.01%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
- Coverage   77.59%   76.05%   -1.55%     
==========================================
  Files          87      136      +49     
  Lines        7784    10552    +2768     
  Branches        0      206     +206     
==========================================
+ Hits         6040     8025    +1985     
- Misses       1744     2433     +689     
- Partials        0       94      +94     
Flag Coverage Δ
libtiledbsoma 67.69% <26.31%> (?)
python 91.40% <84.36%> (+2.31%) ⬆️
r 67.72% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 91.40% <84.36%> (+2.31%) ⬆️
libtiledbsoma 50.61% <14.28%> (∅)

@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch from 7ee0b85 to 8992661 Compare October 17, 2023 15:44
@johnkerl johnkerl changed the title [python] Read Path [python] Read path Oct 18, 2023
@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch from c0d02b0 to 4e63bc4 Compare October 19, 2023 00:36
@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch from 7db68e1 to 2c8fa29 Compare October 20, 2023 21:32
@nguyenv nguyenv changed the title [python] Read path [python] Dataframe read path Oct 20, 2023
@nguyenv nguyenv marked this pull request as ready for review October 20, 2023 23:16
apis/python/src/tiledbsoma/_collection.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_dataframe.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_dataframe.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_dataframe.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch 5 times, most recently from 4477714 to 4306121 Compare November 7, 2023 16:26
Copy link
Contributor

@thetorpedodog thetorpedodog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the usual disclaimer that I’m mostly looking at the Python side for style etc. and didn’t look at the C++ code in detail)

apis/python/src/tiledbsoma/_dataframe.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_factory.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_sparse_nd_array.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Show resolved Hide resolved
apis/python/src/tiledbsoma/_tdb_handles.py Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/_util.py Outdated Show resolved Hide resolved
@thetorpedodog
Copy link
Contributor

(the usual disclaimer that I’m mostly looking at the Python side for style etc. and didn’t look at the C++ code in detail)

I should add that I like the way the open-and-select-the-right-wrapper thing works now. It’s very well structured and understandable.

@nguyenv nguyenv requested a review from thetorpedodog November 9, 2023 01:45
@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch 2 times, most recently from d5f4dc5 to 8f7951e Compare November 9, 2023 22:29
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing huge to sugggest -- some spelling changes etc

libtiledbsoma/test/unit_soma_dense_ndarray.cc Show resolved Hide resolved
libtiledbsoma/test/unit_soma_dataframe.cc Show resolved Hide resolved
libtiledbsoma/test/unit_soma_array.cc Show resolved Hide resolved
libtiledbsoma/src/soma/soma_sparse_ndarray.h Outdated Show resolved Hide resolved
apis/python/src/tiledbsoma/soma_array.cc Show resolved Hide resolved
apis/python/tests/test_platform_config.py Show resolved Hide resolved
libtiledbsoma/src/soma/soma_array.h Outdated Show resolved Hide resolved
libtiledbsoma/src/soma/soma_array.h Outdated Show resolved Hide resolved
libtiledbsoma/src/soma/soma_dataframe.cc Show resolved Hide resolved
* Move `PyQueryCondition` Into `common.h`
* Use Pyarrow Schema instead of TileDB ArraySchema
* Remove TileDB-Py dependency
* No longer requires attr-to-enum mapping passed for dictionaries as
  this can be checked in Pyarrow Schema now
* Eventually the `arrow_schema` calls should replace `schema` but quite
  a few things still depend on the TileDB ArraySchema so this is going
  to be temporarily punted for now
* The R API should be refactored soon to not rely on tiledb::ArraySchema
@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch from 7c4b511 to 03e7a61 Compare January 19, 2024 19:17
@nguyenv nguyenv force-pushed the viviannguyen/python-read-path branch from 03e7a61 to 53cca7e Compare January 19, 2024 19:47
@nguyenv
Copy link
Member Author

nguyenv commented Jan 19, 2024

@beroy I have now merged your changes into my PR. The only change I did was move your reindexer unit tests to reside within the apis/python/tests directory. This is a part of changes within this PR where I've moved all Python unit tests from libtilesoma/tests into the Python API directory as to not have them split between two directories. Because this PR is already large enough as is, I have decided to leave the reindexer bindings within pytiledbsoma.cc for now but will be opening a subsequent PR where this gets separated out into its own reindexer.cc.

@nguyenv nguyenv merged commit 44bfad4 into main Jan 19, 2024
15 checks passed
@nguyenv nguyenv deleted the viviannguyen/python-read-path branch January 19, 2024 21:14
jdblischak added a commit to TileDB-Inc/centralized-tiledb-nightlies that referenced this pull request Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants