-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++/python] Expanded enumeration support in ArrowAdapter::to_arrow
#1848
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ ❗ Your organization needs to install the Codecov GitHub app to enable full functionality. see 34 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
The unit test fail is corrected by TileDB-Inc/TileDB-Py#1853 which requires a TileDB-Py 0.23.4 release. Linked against a local build of TileDB-Py with necessary enum changes as listed in PR description.
Linked against 0.23.3.
|
to_arrow
to_arrow
ArrowAdapter::to_arrow
2c792bf
to
c12be53
Compare
f2c6679
to
6afddc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Focusing on the libtiledbsoma/src changes) Looks good. I wonder if, now that value is defaulting to true
whether we can remove the use_enum
toggle ?
Yeah you're right; we can get rid of that now. |
@nguyenv Cool. Do you want to fold that edit into the PR? |
Yup just made the change. |
…#1848) * [c++/python] Support enumerations with Pandas 2.0+ * Correct typing issues * Add unit test * Remove `use_enum` toggle * Modify `to_arrow` in R API
…#1848) (#1861) * [c++/python] Support enumerations with Pandas 2.0+ * Correct typing issues * Add unit test * Remove `use_enum` toggle * Modify `to_arrow` in R API Co-authored-by: nguyenv <[email protected]>
Issue and/or context:
Updates in Pandas 2.0 has introduced changes to
DictionaryArray.from_arrays
that required refactoring inpytiledbsoma.cc
. This bug motivated a change in the C++ArrowAdapter::to_arrow
method to further handle all supported enumerated types.Changes:
ArrowTable
was converted fromTable.from_arrays
and then enumerations were added in an additional step withDictionaryArray.from_arrays
.DictionaryArray.from_arrays
has now been completely removed and now only usesTable.from_arrays
ArrowAdapter::to_arrow
method. Previous work has been done to support enumerated string values. The changes in this PR have expanded this now handle all supported enumerated types including all integral numeric types, floating point numbers, and Booleanto_arrow_format
now takes ause_large
Boolean argument. For enumerations, we need to usestring
orbinary
rather thanlarge_string
orlarge_binary
Notes for Reviewer:
ColumnBuffer
class to handle string enumerations, and the steps used inArrowAdapter::to_arrow
to populate the data and offet buffers remain untouched. However, I have written a TODO note on how this can be cleaned up in a future refactorstd::vector<bool>
is a specialized vector that does not store values contigously in memory. We must iterate through the container and then use bit masking to set each bit in least significant bit ordering