[c++] Fix display of Arrow schema for enum of `bytes` datatype #2305

johnkerl · 2024-03-22T16:37:15Z

Issue and/or context: Details on #2306

See also: #2311

[sc-43673]

Checker from #2306:

import tiledbsoma as soma
sdf = soma.open('test_dataframe')
print(sdf.schema)

Output after this PR:

soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=binary, indices=int8, ordered=0> not null
byte: large_binary not null

Changes: Report binary appropriately

Notes for Reviewer:

johnkerl · 2024-03-22T16:44:13Z

One problem: when I re-run @bkmartinjr 's data-generator script (the one in the description of this PR which creates test_dataframe) now its output is like this:

$ ./2305.py
** Original Pandas schema
soma_joinid       int64
int_cat        category
int               int64
str_cat        category
str              object
byte_cat       category
byte             object
dtype: object
soma_joinid: dtype('int64')
int_cat: CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64)
int: dtype('int64')
str_cat: CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object)
str: dtype('O')
byte_cat: CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object)
byte: dtype('O')
-----
** Arrow schema, derived from Pandas
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 964
-----
** Arrow Table derived from pandas
pyarrow.Table
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
----
soma_joinid: [[0,1,2,3,4,5]]
int_cat: [  -- dictionary:
[10,20]  -- indices:
[0,1,0,1,1,1]]
int: [[10,20,10,20,20,20]]
str_cat: [  -- dictionary:
["A","B"]  -- indices:
[0,1,0,1,1,1]]
str: [["A","B","A","B","B","B"]]
byte_cat: [  -- dictionary:
[41,42]  -- indices:
[0,1,0,1,1,1]]
byte: [[41,42,41,42,42,42]]
-----
**Created TileDB Array schema
soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=binary, indices=int8, ordered=0> not null
byte: large_binary not null
Traceback (most recent call last):
  File "/Users/johnkerl/Desktop/./2305.py", line 64, in <module>
    main()
  File "/Users/johnkerl/Desktop/./2305.py", line 50, in main
    df = soma_dataframe.read().concat().to_pandas()
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_read_iters.py", line 74, in concat
    return pa.concat_tables(self)
  File "pyarrow/table.pxi", line 5366, in pyarrow.lib.concat_tables
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_read_iters.py", line 70, in __next__
    return next(self._reader)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_read_iters.py", line 464, in _arrow_table_reader
    tbl = sr.read_next()
  File "pyarrow/array.pxi", line 1628, in pyarrow.lib.Array._import_from_c
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected 3 buffers for imported type binary, ArrowArray struct has 2

nguyenv · 2024-03-22T16:55:28Z

https://github.com/single-cell-data/TileDB-SOMA/tree/viviannguyen/array-write-path

I actually already fixed this in the write path refactor now that enumerations are getting handled in the C++ side instead of in Python. Overall though, I still have 21 failing unit tests for this PR, so I don't think I can get my fix out immediately...

(soma-3.11) vivian@mangonada:~/tiledb-bugs$ python 2305.py
** Original Pandas schema
soma_joinid       int64
int_cat        category
int               int64
str_cat        category
str              object
byte_cat       category
byte             object
dtype: object
soma_joinid: dtype('int64')
int_cat: CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64)
int: dtype('int64')
str_cat: CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object)
str: dtype('O')
byte_cat: CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object)
byte: dtype('O')
-----
** Arrow schema, derived from Pandas
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 964
-----
** Arrow Table derived from pandas
pyarrow.Table
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
----
soma_joinid: [[0,1,2,3,4,5]]
int_cat: [  -- dictionary:
[10,20]  -- indices:
[0,1,0,1,1,1]]
int: [[10,20,10,20,20,20]]
str_cat: [  -- dictionary:
["A","B"]  -- indices:
[0,1,0,1,1,1]]
str: [["A","B","A","B","B","B"]]
byte_cat: [  -- dictionary:
[41,42]  -- indices:
[0,1,0,1,1,1]]
byte: [[41,42,41,42,42,42]]
-----
**Created TileDB Array schema
soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=binary, indices=int8, ordered=0> not null
byte: large_binary not null
soma_joinid: dtype('int64'), dtype('int64')
int_cat: CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64), CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64)
Categories dtype: dtype('int64'), dtype('int64')
int: dtype('int64'), dtype('int64')
str_cat: CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object), CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object)
Categories dtype: dtype('O'), dtype('O')
str: dtype('O'), dtype('O')
byte_cat: CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object), CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object)
Categories dtype: dtype('O'), dtype('O')
byte: dtype('O'), dtype('O')
   soma_joinid int_cat  int str_cat str byte_cat  byte
0            0      10   10       A   A     b'A'  b'A'
1            1      20   20       B   B     b'B'  b'B'
2            2      10   10       A   A     b'A'  b'A'
3            3      20   20       B   B     b'B'  b'B'
4            4      20   20       B   B     b'B'  b'B'
5            5      20   20       B   B     b'B'  b'B'

codecov · 2024-03-22T17:30:53Z

Codecov Report

Merging #2305 (f81d4c4) into main (b613f3b) will decrease coverage by 0.92%.
The diff coverage is 0.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2305      +/-   ##
==========================================
- Coverage   78.74%   77.82%   -0.92%     
==========================================
  Files         140      140              
  Lines       10750    10755       +5     
  Branches      215      216       +1     
==========================================
- Hits         8465     8370      -95     
- Misses       2186     2298     +112     
+ Partials       99       87      -12

Flag	Coverage Δ
libtiledbsoma	`63.67% <0.00%> (-3.89%)`	⬇️
python	`90.58% <ø> (ø)`
r	`74.69% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`90.58% <ø> (ø)`
libtiledbsoma	`48.18% <0.00%> (-0.51%)`	⬇️

johnkerl · 2024-03-22T20:23:08Z

libtiledbsoma/src/utils/arrow_adapter.cc

@@ -377,6 +383,8 @@ std::string_view ArrowAdapter::to_arrow_format(
    tiledb_datatype_t datatype, bool use_large) {
    switch (datatype) {
        case TILEDB_STRING_ASCII:
+            return use_large ? "Z" : "z";  // large because TileDB


@nguyenv noted she has a fix ([c++] Fix display of Arrow schema for enum of bytes datatype #2305 (comment)) but that is large and has CI fails

This PR is my attempt to extract the essentials to unblock 1.9 for issue [c++] Fix display of Arrow schema for enum of bytes #2306

When I have "Z" here then the repro script in the description field of this PR passes. But then other existing unit-test cases fail, since what should read back as ["a","b"] reads back as [b"a", b"a"]

Yet when I have "U" here, then the repro script in the description field of this PR fails -- the reported problem is simply not fixed -- but all existing unit-test cases pass.

I seem to recall that STRING_ASCII along with some other flag, metadata something-something, is used to differentiate between string-as-bytes and string-as-string ... @nguyenv / @ihnorton perhaps you recall how TileDB-Py does this (as it successfully does, in the code snippet included in the description field of this PR).

main...viviannguyen/array-write-path#diff-c66edd736f9a9cb059c6f6e0f4cc6d1cdfa0fd91a2b9220dd2e83c12c2546089R451-R457
main...viviannguyen/array-write-path#diff-c66edd736f9a9cb059c6f6e0f4cc6d1cdfa0fd91a2b9220dd2e83c12c2546089R407

I believe these are the snippets that should fix the issue you're seeing.

I specially handle the string types for enums like that rather relying on to_tiledb_format.

@nguyenv both of the links you sent me open up with "57 files changed" -- non-navigable

Nonetheless I'll take a look at what you wrote in Slack 🤞

johnkerl · 2024-03-22T23:14:24Z

Thanks @nguyenv for the advice! CI is green, and my interactive use as noted in the description field of this PR also works great. :D

cc @ihnorton @jp-dark or anyone else still on line at this late hour for approval; or we can do this Monday.

ihnorton

Consider adding the script from #2306 to a python repr test? Otherwise LGTM.

johnkerl · 2024-03-24T17:46:41Z

Thanks @ihnorton

I was delaying the merge for a unit-test case

It'll be more 'somatic' to test it like this:

    with soma.open(uri) as sdf:

        f = sdf.schema.field("int_cat")
        assert f.type.index_type == pa.int8()
        assert f.type.value_type == pa.int64()


        f = sdf.schema.field("str_cat")
        assert f.type.index_type == pa.int8()
        assert f.type.value_type == pa.string()


        f = sdf.schema.field("byte_cat")
        assert f.type.index_type == pa.int8()
        assert f.type.value_type == pa.binary()

with data as ☝️ and I'll get that in before merging

johnkerl · 2024-03-24T17:51:57Z

@nguyenv this is still not right :(

When I said this above:

Thanks @nguyenv for the advice! CI is green, and my interactive use as noted in the description field of this PR also works great. :D

I was misreading my own output :(

The original repro script I was using, I had confused myself by putting attrs in the order

soma_joinid: int64 not null
int: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
str: large_string not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
byte: large_binary not null
byte_cat: dictionary<values=string, indices=int8, ordered=0> not null

A less error-prone ordering is here: https://gist.github.com/johnkerl/6c4a8ee3ceacaaf2e1929873095b2933

And we still have:

import tiledbsoma as soma
sdf = soma.open('test_dataframe')
print(sdf.schema)

outputting

soma_joinid: int64 not null
int: int64 not null
str: large_string not null
byte: large_binary not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
byte_cat: dictionary<values=string, indices=int8, ordered=0> not null <---- INCORRECT

even though TileDB-Py reports correctly, with

#!/usr/bin/env python

import os
import sys
import tiledb

A = tiledb.open('test_dataframe')

for i in range(A.schema.nattr):
    attr = A.schema.attr(i)
    try:
        index_type = attr.dtype
        value_type = A.enum(attr.name).dtype
        print(f"enum name={attr.name} index_type={index_type.name} value_type={value_type.name}")
    except tiledb.cc.TileDBError:
        pass # not an enum attr

outputting

enum name=int_cat index_type=int8 value_type=int64
enum name=str_cat index_type=int8 value_type=str32
enum name=byte_cat index_type=int8 value_type=bytes <---- CORRECT

nguyenv · 2024-03-24T17:56:22Z

OK I will take a look tomorrow.

johnkerl · 2024-03-24T17:57:01Z

At present with R and this PR we have:

> sdf <- SOMADataFrameOpen('test_dataframe')
> sdf$schema()
Error: Unsupported data type: CHAR

so I'll add an R unit-test case here as well

johnkerl · 2024-03-25T22:37:26Z

Re my

so I'll add an R unit-test case here as well

we need first to fix R CI failures

johnkerl · 2024-03-26T17:31:51Z

All current CI is green. There are still a few things wrong I'm seeing interactively for which we had inadequate unit-test coverage. More commits coming soon.

eddelbuettel · 2024-03-26T17:43:24Z

Keep in mind that these arrow pieces will like benefit from / change under / get simpler the nanoarrow refactoring. However I am currently blocked on that one as I see a lot of Python CI shrapnel (though only from IIRC three test files, the majority passes) that I cannot make real progress on ... as I am a complete neophyte when it comes to Python / C(++) interop.

johnkerl · 2024-03-26T18:12:41Z

Good commits on this PR; all green; further work tracked on #2324.

@nguyenv

* [c++] Fix display of Arrow schema for enum of bytes * bugfix per @nguyenv * more of same * try 3 with help of @nguyenv * python unit-test case * extend the unit-test case a bit * implement @nguyenv advice * Partial reversal of schema format assignment * test --------- Co-authored-by: Dirk Eddelbuettel <[email protected]>

@nguyenv

#2325) * [c++] Fix display of Arrow schema for enum of bytes * bugfix per @nguyenv * more of same * try 3 with help of @nguyenv * python unit-test case * extend the unit-test case a bit * implement @nguyenv advice * Partial reversal of schema format assignment * test --------- Co-authored-by: John Kerl <[email protected]> Co-authored-by: Dirk Eddelbuettel <[email protected]>

shortcut-integration · 2024-03-26T18:39:30Z

This pull request has been linked to Shortcut Story #43673: tiledbsoma 1.9.0.

Co-authored-by: John Kerl <[email protected]>

[c++] Fix display of Arrow schema for enum of bytes

70ac9ac

johnkerl requested review from jp-dark and nguyenv March 22, 2024 16:37

johnkerl marked this pull request as ready for review March 22, 2024 16:38

johnkerl marked this pull request as draft March 22, 2024 16:42

johnkerl added the blocks-1.9 label Mar 22, 2024

johnkerl mentioned this pull request Mar 22, 2024

[c++] Fix display of Arrow schema for enum of bytes #2306

Closed

johnkerl removed the blocks-1.9 label Mar 22, 2024

johnkerl added 2 commits March 22, 2024 13:43

bugfix per @nguyenv

ad54390

more of same

0b06695

johnkerl force-pushed the kerl/arrow-schema-bytes branch from 45cbcb4 to 0b06695 Compare March 22, 2024 18:21

johnkerl requested a review from ihnorton March 22, 2024 20:18

johnkerl marked this pull request as ready for review March 22, 2024 20:19

johnkerl commented Mar 22, 2024

View reviewed changes

try 3 with help of @nguyenv

8422fa7

johnkerl added the backport release-1.8 label Mar 22, 2024

johnkerl changed the title ~~[c++] Fix display of Arrow schema for enum of bytes~~ [c++] Fix display of Arrow schema for enum of bytes datatype Mar 22, 2024

ihnorton approved these changes Mar 23, 2024

View reviewed changes

johnkerl requested a review from ihnorton March 24, 2024 17:52

johnkerl added 3 commits March 24, 2024 17:38

python unit-test case

629e7bb

extend the unit-test case a bit

d37b997

implement @nguyenv advice

2e9f01a

Merge branch 'main' into kerl/arrow-schema-bytes

472e388

johnkerl mentioned this pull request Mar 26, 2024

tiledbsoma 1.9.0 pre check TileDB-Inc/tiledbsoma-feedstock#110

Closed

Partial reversal of schema format assignment

793ffa3

johnkerl mentioned this pull request Mar 26, 2024

[python/ci] Unbreak Python 3.8 CI #2323

Closed

test

f81d4c4

johnkerl mentioned this pull request Mar 26, 2024

[ci] Set up a Python/R interop test case for readback of bytes types #2324

Open

johnkerl merged commit dfa1de9 into main Mar 26, 2024
15 checks passed

johnkerl deleted the kerl/arrow-schema-bytes branch March 26, 2024 18:13

github-actions bot mentioned this pull request Mar 26, 2024

[Backport release-1.8] [c++] Fix display of Arrow schema for enum of bytes datatype #2325

Merged

johnkerl added a commit that referenced this pull request Mar 26, 2024

[python/ci] Typofix from #2305

4b791ab

johnkerl added a commit that referenced this pull request Mar 26, 2024

[python/ci] Typofix from #2305 (#2326)

9d7d407

github-actions bot pushed a commit that referenced this pull request Mar 26, 2024

[python/ci] Typofix from #2305 (#2326)

ea42ec5

johnkerl added a commit that referenced this pull request Mar 26, 2024

[python/ci] Typofix from #2305 (#2326) (#2330)

b28ea6e

Co-authored-by: John Kerl <[email protected]>

github-actions bot pushed a commit that referenced this pull request Mar 26, 2024

[python/ci] Typofix from #2305 (#2326)

aba4539

johnkerl added a commit that referenced this pull request Mar 26, 2024

[python/ci] Typofix from #2305 (#2326) (#2331)

51fd41e

Co-authored-by: John Kerl <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c++] Fix display of Arrow schema for enum of `bytes` datatype #2305

[c++] Fix display of Arrow schema for enum of `bytes` datatype #2305

johnkerl commented Mar 22, 2024 •

edited

Loading

johnkerl commented Mar 22, 2024

nguyenv commented Mar 22, 2024 •

edited

Loading

codecov bot commented Mar 22, 2024 •

edited

Loading

johnkerl Mar 22, 2024

nguyenv Mar 22, 2024

nguyenv Mar 22, 2024

johnkerl Mar 22, 2024

johnkerl commented Mar 22, 2024

ihnorton left a comment

johnkerl commented Mar 24, 2024 •

edited

Loading

johnkerl commented Mar 24, 2024 •

edited

Loading

nguyenv commented Mar 24, 2024

johnkerl commented Mar 24, 2024

johnkerl commented Mar 25, 2024

johnkerl commented Mar 26, 2024

eddelbuettel commented Mar 26, 2024

johnkerl commented Mar 26, 2024

shortcut-integration bot commented Mar 26, 2024

[c++] Fix display of Arrow schema for enum of bytes datatype #2305

[c++] Fix display of Arrow schema for enum of bytes datatype #2305

Conversation

johnkerl commented Mar 22, 2024 • edited Loading

johnkerl commented Mar 22, 2024

nguyenv commented Mar 22, 2024 • edited Loading

codecov bot commented Mar 22, 2024 • edited Loading

Codecov Report

johnkerl Mar 22, 2024

Choose a reason for hiding this comment

nguyenv Mar 22, 2024

Choose a reason for hiding this comment

nguyenv Mar 22, 2024

Choose a reason for hiding this comment

johnkerl Mar 22, 2024

Choose a reason for hiding this comment

johnkerl commented Mar 22, 2024

ihnorton left a comment

Choose a reason for hiding this comment

johnkerl commented Mar 24, 2024 • edited Loading

johnkerl commented Mar 24, 2024 • edited Loading

nguyenv commented Mar 24, 2024

johnkerl commented Mar 24, 2024

johnkerl commented Mar 25, 2024

johnkerl commented Mar 26, 2024

eddelbuettel commented Mar 26, 2024

johnkerl commented Mar 26, 2024

shortcut-integration bot commented Mar 26, 2024

[c++] Fix display of Arrow schema for enum of `bytes` datatype #2305

[c++] Fix display of Arrow schema for enum of `bytes` datatype #2305

johnkerl commented Mar 22, 2024 •

edited

Loading

nguyenv commented Mar 22, 2024 •

edited

Loading

codecov bot commented Mar 22, 2024 •

edited

Loading

johnkerl commented Mar 24, 2024 •

edited

Loading

johnkerl commented Mar 24, 2024 •

edited

Loading