Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Support Enumeration in C++ codebase #1519

Merged
merged 13 commits into from
Sep 14, 2023

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Jul 5, 2023

Issue and/or context:

#866

As per discussion with @eddelbuettel, separating the C++ code out from #1511 so that it can be utilized in both the Python and R APIs.

Changes:

  • Addition of SOMAArray::get_attr_to_enum_mapping, SOMAArray::get_enum and SOMAArray::get_enum_label_on_attr
  • Attach an enumeration/dictionary to the ColumnBuffer, if applicable; this is used when converting from ArrayBuffers to Arrow Tables in the Python and R APIs

@nguyenv nguyenv requested a review from eddelbuettel July 5, 2023 21:38

std::shared_ptr<ColumnBuffer> ColumnBuffer::create(
ArraySchema schema, std::string_view name) {
auto schema = array->schema();
auto name_str = std::string(name); // string for TileDB API

if (schema.has_attribute(name_str)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I so far have also limited myself to attributes but I think the core implementation is more powerful and we could have enumerations as dimensions (!!) given that we already have both int and char dims anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's double-check with @davisp -- ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big nopes to enumeration support for dimensions. The primitives are close, but there are a whole bunch of non-obvious behavior issues we’d run into due to dimensions requiring a defined sort and a bunch of other things I don’t know enough about to even begin listing.

Which is to say that its probably not technically infeasible, but the consensus at the design stage was call dimension support out of scope for the MVP/V1 implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for clarifying. I have been running so far with 'attributes only' too but must have hallucinated myself into thinking you had mused about dims too. I probably poorly inferred from query conditions. All good then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! This is absolutely one of those situations where, given the primitives, it seems like adding the functionality would be straightforward, but then when we start working through the edge cases there are a whole bunch of open questions on what the behavior should be with no clear answers.

The current implementation ignores dimensions purely out of expediency to get something usable first which both makes sure that its usable as designed and should also help shake out anything that was overlooked. I’d wager a small amount of money that enumerations for dimensions will exist eventually, its just a matter of making sure our behaviors are all intentional for long term support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man you make me look good because I did use the lawyer-advised (kidding here) conditional above:

and we could have enumerations as dimensions

Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this. I may try to give it a whirl building with it next.

@johnkerl johnkerl changed the title [c++] Support Enumeration in C++ Codebase [c++] Support Enumeration in C++ codebase Jul 7, 2023
@nguyenv nguyenv force-pushed the viviannguyen/enumerated-dtypes-in-cpp branch from 44a0f39 to 416d557 Compare July 13, 2023 09:24
@nguyenv nguyenv force-pushed the viviannguyen/enumerated-dtypes-in-cpp branch from 416d557 to 1d2a05f Compare August 16, 2023 19:20
@nguyenv nguyenv force-pushed the viviannguyen/enumerated-dtypes-in-cpp branch from 1d2a05f to 97ab0a6 Compare September 5, 2023 20:17
* Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr`
* Attach an enumeration/dictionary to the `ColumnBuffer` if applicable;
  this is used when converting from `ArrayBuffers` to Arrow Tables in
  the Python and R APIs
* Prior to TileDB-Inc/TileDB#4272, the SOMA unit
tests were erroneously writing a byte vector for string dimensions which
maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII`
@nguyenv nguyenv force-pushed the viviannguyen/enumerated-dtypes-in-cpp branch from 97ab0a6 to 3d01f83 Compare September 6, 2023 15:58
@codecov-commenter
Copy link

codecov-commenter commented Sep 8, 2023

Codecov Report

Patch has no changes to coverable lines.

❗ Current head 65baaaa differs from pull request most recent head f945eca. Consider uploading reports for the commit f945eca to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

📢 Thoughts on this report? Let us know!.

@nguyenv nguyenv marked this pull request as ready for review September 14, 2023 15:18
@nguyenv
Copy link
Member Author

nguyenv commented Sep 14, 2023

Should we also merge the C++ changes from #1674 into this branch?

@johnkerl
Copy link
Member

@nguyenv sure? @eddelbuettel what do you think?

@eddelbuettel
Copy link
Contributor

@johnkerl @nguyenv I think we will have some truly awful rebases coming up as a few files, notably in libtiledbsoma, were changed concurrently in different branches. The good news is that things work now, so if we're careful they will still once we're done aligning.

@nguyenv
Copy link
Member Author

nguyenv commented Sep 14, 2023

Yeah I am anxious about all merge conflicts that are about to happen and wondering why the best way to go about this is.

@johnkerl
Copy link
Member

@nguyenv @eddelbuettel I take that back -- preferring to keep this C++ PR foundational, merge it without 2.17.0-rc0 dependencies, and go from there.

@johnkerl
Copy link
Member

Yeah I am anxious about all merge conflicts that are about to happen and wondering why the best way to go about this is.

One at a time. This PR is foundational. #1511 next.

@johnkerl johnkerl merged commit aa1c3fe into main Sep 14, 2023
@johnkerl johnkerl deleted the viviannguyen/enumerated-dtypes-in-cpp branch September 14, 2023 20:41
@johnkerl johnkerl mentioned this pull request Sep 14, 2023
ihnorton pushed a commit that referenced this pull request Sep 15, 2023
As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

**Changes:**

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes).

**Notes for Reviewer:**

~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~  The branch and PR are ready but should only be merged once prequisites are been merged.  It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.
johnkerl added a commit that referenced this pull request Sep 15, 2023
* **Issue and/or context:**

As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

**Changes:**

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes).

**Notes for Reviewer:**

~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~  The branch and PR are ready but should only be merged once prequisites are been merged.  It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.

* **Issue and/or context:**

This PR adds support for return Arrow tables with dictionaries that can include `ordered` enumerations.

**Changes:**

Given #1559 which it depends upon, a very small change to just three files in `libtiledbsoma`.

This should become clearer once the dependent PR is merged and can be rebased.

**Notes for Reviewer:**

[SC 34073](https://app.shortcut.com/tiledb-inc/story/34073/c-add-ordered-support-to-arrow-export)

* **Issue and/or context:**

This PR extends the `schema()` function to return an Arrow schema with enumerations including `ordered`.

**Changes:**

Given #1559 which it depends upon, a very small change to just one file.

This should become clearer once the dependent PR is merged and can be rebased.

**Notes for Reviewer:**

[SC 34074](https://app.shortcut.com/tiledb-inc/story/34074/c-add-ordered-support-to-arrow-export)

* [c++] Test fixes for #1559 (#1684)

* ihn/bugfix

* unit-test update

* lint

---------

Co-authored-by: John Kerl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants