Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++/python] Expanded enumeration support in ArrowAdapter::to_arrow #1848

Merged
merged 5 commits into from
Nov 3, 2023

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Nov 2, 2023

Issue and/or context:

Updates in Pandas 2.0 has introduced changes to DictionaryArray.from_arrays that required refactoring in pytiledbsoma.cc. This bug motivated a change in the C++ ArrowAdapter::to_arrow method to further handle all supported enumerated types.

Changes:

  • Several pre-existing C++ and Pybind11 modifications have pulled from [python] Dataframe read path #1793
  • Previously in the Python API the ArrowTable was converted from Table.from_arrays and then enumerations were added in an additional step with DictionaryArray.from_arrays. DictionaryArray.from_arrays has now been completely removed and now only uses Table.from_arrays
  • The required changes for the above to work required refactoring of the C++ ArrowAdapter::to_arrow method. Previous work has been done to support enumerated string values. The changes in this PR have expanded this now handle all supported enumerated types including all integral numeric types, floating point numbers, and Boolean
  • to_arrow_format now takes a use_large Boolean argument. For enumerations, we need to use string or binary rather than large_string or large_binary

Notes for Reviewer:

  • Pre-existing changes to the ColumnBuffer class to handle string enumerations, and the steps used in ArrowAdapter::to_arrow to populate the data and offet buffers remain untouched. However, I have written a TODO note on how this can be cleaned up in a future refactor
  • Boolean enumerations require special handling from other numeric types as std::vector<bool> is a specialized vector that does not store values contigously in memory. We must iterate through the container and then use bit masking to set each bit in least significant bit ordering
  • Awaiting TileDB-Py 0.23.4 release

@codecov-commenter
Copy link

codecov-commenter commented Nov 2, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

see 34 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@nguyenv
Copy link
Member Author

nguyenv commented Nov 2, 2023

The unit test fail is corrected by TileDB-Inc/TileDB-Py#1853 which requires a TileDB-Py 0.23.4 release.

Linked against a local build of TileDB-Py with necessary enum changes as listed in PR description.

(tiledb-3.7) vivian@mangonada:~/TileDB-SOMA$ pip list | grep tiledb
tiledb             0.23.4.dev6                  /home/vivian/TileDB-Py
tiledbsoma         1.5.0rc0.post47.dev746138611
(tiledb-3.7) vivian@mangonada:~/TileDB-SOMA$ pytest apis/python/tests/test_dataframe.py::test_extend_enumerations
=============================================== test session starts ===============================================
platform linux -- Python 3.7.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /home/vivian/TileDB-SOMA/apis/python
plugins: hypothesis-6.79.3, typeguard-2.13.3
collected 1 item

apis/python/tests/test_dataframe.py .                                                                       [100%]

================================================ 1 passed in 1.20s ================================================

Linked against 0.23.3.

(tiledb-3.7) vivian@mangonada:~/TileDB-SOMA$ pip list | grep tiledb
tiledb             0.23.3
tiledbsoma         1.5.0rc0.post47.dev746138611
(tiledb-3.7) vivian@mangonada:~/TileDB-SOMA$ pytest apis/python/tests/test_dataframe.py::test_extend_enumerations
=============================================== test session starts ===============================================
platform linux -- Python 3.7.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /home/vivian/TileDB-SOMA/apis/python
plugins: hypothesis-6.79.3, typeguard-2.13.3
collected 1 item

apis/python/tests/test_dataframe.py F                                                                       [100%]

==================================================== FAILURES =====================================================
...

@nguyenv nguyenv changed the title [c++/python] Support enumerations with Pandas 2.0+ [c++/python] Expanded enumeration support in to_arrow Nov 3, 2023
@nguyenv nguyenv marked this pull request as ready for review November 3, 2023 16:14
@nguyenv nguyenv changed the title [c++/python] Expanded enumeration support in to_arrow [c++/python] Expanded enumeration support in ArrowAdapter::to_arrow Nov 3, 2023
@nguyenv nguyenv force-pushed the viviannguyen/correct-arrow-adapter-for-enums branch from 2c792bf to c12be53 Compare November 3, 2023 19:29
@nguyenv nguyenv changed the base branch from main to kerl/2.17.4 November 3, 2023 19:30
Base automatically changed from kerl/2.17.4 to main November 3, 2023 19:39
@nguyenv nguyenv force-pushed the viviannguyen/correct-arrow-adapter-for-enums branch from f2c6679 to 6afddc6 Compare November 3, 2023 19:50
Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Focusing on the libtiledbsoma/src changes) Looks good. I wonder if, now that value is defaulting to true whether we can remove the use_enum toggle ?

@nguyenv
Copy link
Member Author

nguyenv commented Nov 3, 2023

Yeah you're right; we can get rid of that now.

@eddelbuettel
Copy link
Contributor

@nguyenv Cool. Do you want to fold that edit into the PR?

@nguyenv
Copy link
Member Author

nguyenv commented Nov 3, 2023

Yup just made the change.

@johnkerl johnkerl merged commit e37f356 into main Nov 3, 2023
4 checks passed
@johnkerl johnkerl deleted the viviannguyen/correct-arrow-adapter-for-enums branch November 3, 2023 23:54
github-actions bot pushed a commit that referenced this pull request Nov 3, 2023
…#1848)

* [c++/python] Support enumerations with Pandas 2.0+

* Correct typing issues

* Add unit test

* Remove `use_enum` toggle

* Modify `to_arrow` in R API
johnkerl pushed a commit that referenced this pull request Nov 3, 2023
…#1848) (#1861)

* [c++/python] Support enumerations with Pandas 2.0+

* Correct typing issues

* Add unit test

* Remove `use_enum` toggle

* Modify `to_arrow` in R API

Co-authored-by: nguyenv <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants