-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++] Support Enumeration
in C++ codebase
#1519
Conversation
|
||
std::shared_ptr<ColumnBuffer> ColumnBuffer::create( | ||
ArraySchema schema, std::string_view name) { | ||
auto schema = array->schema(); | ||
auto name_str = std::string(name); // string for TileDB API | ||
|
||
if (schema.has_attribute(name_str)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I so far have also limited myself to attributes but I think the core implementation is more powerful and we could have enumerations as dimensions (!!) given that we already have both int and char dims anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's double-check with @davisp -- ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big nopes to enumeration support for dimensions. The primitives are close, but there are a whole bunch of non-obvious behavior issues we’d run into due to dimensions requiring a defined sort and a bunch of other things I don’t know enough about to even begin listing.
Which is to say that its probably not technically infeasible, but the consensus at the design stage was call dimension support out of scope for the MVP/V1 implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for clarifying. I have been running so far with 'attributes only' too but must have hallucinated myself into thinking you had mused about dims too. I probably poorly inferred from query conditions. All good then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! This is absolutely one of those situations where, given the primitives, it seems like adding the functionality would be straightforward, but then when we start working through the edge cases there are a whole bunch of open questions on what the behavior should be with no clear answers.
The current implementation ignores dimensions purely out of expediency to get something usable first which both makes sure that its usable as designed and should also help shake out anything that was overlooked. I’d wager a small amount of money that enumerations for dimensions will exist eventually, its just a matter of making sure our behaviors are all intentional for long term support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Man you make me look good because I did use the lawyer-advised (kidding here) conditional above:
and we could have enumerations as dimensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this. I may try to give it a whirl building with it next.
Enumeration
in C++ CodebaseEnumeration
in C++ codebase
44a0f39
to
416d557
Compare
416d557
to
1d2a05f
Compare
1d2a05f
to
97ab0a6
Compare
* Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr` * Attach an enumeration/dictionary to the `ColumnBuffer` if applicable; this is used when converting from `ArrayBuffers` to Arrow Tables in the Python and R APIs
* Prior to TileDB-Inc/TileDB#4272, the SOMA unit tests were erroneously writing a byte vector for string dimensions which maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII`
97ab0a6
to
3d01f83
Compare
Codecov ReportPatch has no changes to coverable lines.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. 📢 Thoughts on this report? Let us know!. |
Should we also merge the C++ changes from #1674 into this branch? |
@nguyenv sure? @eddelbuettel what do you think? |
Yeah I am anxious about all merge conflicts that are about to happen and wondering why the best way to go about this is. |
@nguyenv @eddelbuettel I take that back -- preferring to keep this C++ PR foundational, merge it without 2.17.0-rc0 dependencies, and go from there. |
One at a time. This PR is foundational. #1511 next. |
As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available **Changes:** This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes). **Notes for Reviewer:** ~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~ The branch and PR are ready but should only be merged once prequisites are been merged. It likely needs #1519 (C++ side) and #1663 (CI support). CI is turned off as the TileDB default build is still without support for enumerations.
* **Issue and/or context:** As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available **Changes:** This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes). **Notes for Reviewer:** ~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~ The branch and PR are ready but should only be merged once prequisites are been merged. It likely needs #1519 (C++ side) and #1663 (CI support). CI is turned off as the TileDB default build is still without support for enumerations. * **Issue and/or context:** This PR adds support for return Arrow tables with dictionaries that can include `ordered` enumerations. **Changes:** Given #1559 which it depends upon, a very small change to just three files in `libtiledbsoma`. This should become clearer once the dependent PR is merged and can be rebased. **Notes for Reviewer:** [SC 34073](https://app.shortcut.com/tiledb-inc/story/34073/c-add-ordered-support-to-arrow-export) * **Issue and/or context:** This PR extends the `schema()` function to return an Arrow schema with enumerations including `ordered`. **Changes:** Given #1559 which it depends upon, a very small change to just one file. This should become clearer once the dependent PR is merged and can be rebased. **Notes for Reviewer:** [SC 34074](https://app.shortcut.com/tiledb-inc/story/34074/c-add-ordered-support-to-arrow-export) * [c++] Test fixes for #1559 (#1684) * ihn/bugfix * unit-test update * lint --------- Co-authored-by: John Kerl <[email protected]>
Issue and/or context:
#866
As per discussion with @eddelbuettel, separating the C++ code out from #1511 so that it can be utilized in both the Python and R APIs.
Changes:
SOMAArray::get_attr_to_enum_mapping
,SOMAArray::get_enum
andSOMAArray::get_enum_label_on_attr
ColumnBuffer
, if applicable; this is used when converting fromArrayBuffers
to Arrow Tables in the Python and R APIs