-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] Add enumeration support #1559
Conversation
This pull request has been linked to Shortcut Story #32235: [r] Support enumerations in SOMA. |
a3960e8
to
c1b070c
Compare
A quick snapshot of what works as of today / this branch when 2.17 is available. Codelibrary(tiledbsoma)
library(tiledb)
pp <- palmerpenguins::penguins
uri <- tempfile(pattern="tiledb")
fromDataFrame(pp, uri)
## read one: soma_array_reader
p <- soma_array_reader(uri)
print(tibble::as_tibble(arrow::RecordBatch$import_from_c(p[[1]], p[[2]])))
## read twp: iterated reader for sdf
p <- SOMADataFrameOpen(uri)$read()$concat()
print(tibble::as_tibble(p)) Output# A tibble: 344 × 9
`__tiledb_rows` species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<int> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 4 Adelie Torgersen NA NA NA NA NA 2007
5 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 9 Adelie Torgersen 34.1 18.1 193 3475 NA 2007
10 10 Adelie Torgersen 42 20.2 190 4250 NA 2007
# ℹ 334 more rows
# A tibble: 344 × 9
`__tiledb_rows` species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<int> <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 4 Adelie Torgersen NA NA NA NA NA 2007
5 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
7 7 Adelie Torgersen 38.9 17.8 181 3625 female 2007
8 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
9 9 Adelie Torgersen 34.1 18.1 193 3475 NA 2007
10 10 Adelie Torgersen 42 20.2 190 4250 NA 2007
# ℹ 334 more rows |
c45ee26
to
4eb3f90
Compare
9a4c341
to
407f83e
Compare
8013da5
to
4668e2e
Compare
@eddelbuettel can you please resolve conflicts? |
4668e2e
to
455182e
Compare
Rebased again (which was a little more involved than usual), passes 1930 tests here. (I used to have all CI turned off waiting for the merge of TileDB Core, that regressed so some read X now.) |
fd647bf
to
ed40f3b
Compare
ed40f3b
to
7f9ee36
Compare
Codecov ReportPatch has no changes to coverable lines.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. 📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Pending CI of course! 😎
I should probaby update the libtiledbsoma side from the rc to the release too. Or will you take of that 'soon' (pending the unit test update) ? (And I do expect this to go ❌ on python interop as the last ones did....) And so it happens. So I thing this needs some work somewhere else first, then a rebase and then we can talk |
@eddelbuettel I tried #1677 as a candidate to go underneath #1519 #1511 #1559. There's a CI issue I'm puzzling over. Let's chat in Slack. |
@johnkerl I am lost. Might be a candidate for outreach to the author of these contributed tests too. |
The failing CI step is R/Python interop which is here |
* [c++] Support `Enumeration` in C++ Codebase * Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr` * Attach an enumeration/dictionary to the `ColumnBuffer` if applicable; this is used when converting from `ArrayBuffers` to Arrow Tables in the Python and R APIs * Add `get_attr_to_enum_mapping` Function * WIP fix bug where attr name was passed instead of enum name * Add Unit Tests for Enumeration in C++ * `to_varlen_buffers` Returns `std::string` * Prior to TileDB-Inc/TileDB#4272, the SOMA unit tests were erroneously writing a byte vector for string dimensions which maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII` * Update SOMA Array get_metadata Signatures * Depend on 2.17.0-rc0 * resolve an incomplete merge with #1559 * [r] Update tiledb-r to RC (borrowed from #1663, #1665) * [r] Undo brown-bag typo in helper script * 2.17.0 --------- Co-authored-by: John Kerl <[email protected]> Co-authored-by: Dirk Eddelbuettel <[email protected]>
6ca5418
to
de82254
Compare
Ok, did a first rebase (which was much smother than a local one I had recently) so fingers crossed! |
As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available **Changes:** This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the `tiledb` R package for writes). **Notes for Reviewer:** ~This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.~ The branch and PR are ready but should only be merged once prequisites are been merged. It likely needs #1519 (C++ side) and #1663 (CI support). CI is turned off as the TileDB default build is still without support for enumerations.
This PR adds support for return Arrow tables with dictionaries that can include `ordered` enumerations. **Changes:** Given #1559 which it depends upon, a very small change to just three files in `libtiledbsoma`. This should become clearer once the dependent PR is merged and can be rebased. **Notes for Reviewer:** [SC 34073](https://app.shortcut.com/tiledb-inc/story/34073/c-add-ordered-support-to-arrow-export)
This PR extends the `schema()` function to return an Arrow schema with enumerations including `ordered`. **Changes:** Given #1559 which it depends upon, a very small change to just one file. This should become clearer once the dependent PR is merged and can be rebased. **Notes for Reviewer:** [SC 34074](https://app.shortcut.com/tiledb-inc/story/34074/c-add-ordered-support-to-arrow-export)
Ok if we all hold our breath for 20 minutes I am sure this run will turn all green.... |
@eddelbuettel That's the plan! :) |
Cnttlkrtksmhldngmbrthshrd |
(That was "Can't type or talk as I am holding my breath" approximately free of vowels ... 😉 ) |
0d7cb02
to
10a94cd
Compare
Yay yay yay Not so fast. One more. |
* [r] Update `write_soma.data.frame()` to support categoricals Allow `write_soma.data.frame()` to write factors as categoricals. `write_soma()` will no longer cast factors to characters prior to writing Redo of #1683 since it got removed from #1559 * Handle enumerations in `write_soma.data.frame()` * Add tests
Issue and/or context:
As described in #1558 and #866, adding enumeration support is desirable once we have TileDB Embedded 2.17 available
Changes:
This PR supports reading of columns with enumerations (aka dictionaries aka factor variable) directly via Arrow. Preliminary write support is also available (but still goes through the
tiledb
R package for writes).Notes for Reviewer:
This PR is now work-in-progress and not ready for a merge while we await TileDB 2.17.The branch and PR are ready but should only be merged once prequisites are been merged. It likely needs #1519 (C++ side) and #1663 (CI support).CI is turned off as the TileDB default build is still without support for enumerations.