Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracking] Add support for enumerated types aka categoricals aka factors #866

Closed
johnkerl opened this issue Feb 3, 2023 · 1 comment
Closed
Assignees
Labels
enhancement New feature or request

Comments

@johnkerl
Copy link
Member

johnkerl commented Feb 3, 2023

Many systems such as AnnData, Pandas, Arrow, and the R language itself support categoricals.

  • Simple example: red, yelllow, green
  • Users get to specify sort order -- if these were strings they'd be green, red, yellow
  • Storage backends can map these to encoded integers with a lookup mapping if desired
  • Similarly, these lend themselves well to dictionary encoding in backend storage

Status quo in TileDB-SOMA has been that these are "decategoricalized" or "flattened" to strings (or ints, etc.)

Evaluation plan:

@johnkerl johnkerl added the enhancement New feature or request label Feb 3, 2023
@johnkerl johnkerl changed the title Add support for categoricals Add support for categoricals (TileDB-Core tracker) Feb 3, 2023
@johnkerl johnkerl changed the title Add support for categoricals (TileDB-Core tracker) Add support for categoricals (TileDB-Core feature tracker) Feb 3, 2023
@johnkerl johnkerl changed the title Add support for categoricals (TileDB-Core feature tracker) Add support for enumerated types AKA categoricals AKA factors Jul 5, 2023
@johnkerl johnkerl changed the title Add support for enumerated types AKA categoricals AKA factors [tracking] Add support for enumerated types AKA categoricals AKA factors Jul 5, 2023
@johnkerl johnkerl removed their assignment Jul 19, 2023
ihnorton pushed a commit that referenced this issue Sep 15, 2023
As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

**Changes:**

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes).

**Notes for Reviewer:**

~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~  The branch and PR are ready but should only be merged once prequisites are been merged.  It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.
johnkerl added a commit that referenced this issue Sep 15, 2023
* **Issue and/or context:**

As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available

**Changes:**

This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes).

**Notes for Reviewer:**

~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~  The branch and PR are ready but should only be merged once prequisites are been merged.  It likely needs #1519 (C++ side) and #1663 (CI support).

CI is turned off as the TileDB default build is still without support for enumerations.

* **Issue and/or context:**

This PR adds support for return Arrow tables with dictionaries that can include `ordered` enumerations.

**Changes:**

Given #1559 which it depends upon, a very small change to just three files in `libtiledbsoma`.

This should become clearer once the dependent PR is merged and can be rebased.

**Notes for Reviewer:**

[SC 34073](https://app.shortcut.com/tiledb-inc/story/34073/c-add-ordered-support-to-arrow-export)

* **Issue and/or context:**

This PR extends the `schema()` function to return an Arrow schema with enumerations including `ordered`.

**Changes:**

Given #1559 which it depends upon, a very small change to just one file.

This should become clearer once the dependent PR is merged and can be rebased.

**Notes for Reviewer:**

[SC 34074](https://app.shortcut.com/tiledb-inc/story/34074/c-add-ordered-support-to-arrow-export)

* [c++] Test fixes for #1559 (#1684)

* ihn/bugfix

* unit-test update

* lint

---------

Co-authored-by: John Kerl <[email protected]>
@johnkerl
Copy link
Member Author

Only remaining issue is #1710 which has its own tracking; closing this parent/tracker issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants