Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Fix display of Arrow schema for enum of bytes datatype #2305

Merged
merged 10 commits into from
Mar 26, 2024

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Mar 22, 2024

Issue and/or context: Details on #2306

See also: #2311

[sc-43673]

Checker from #2306:

import tiledbsoma as soma
sdf = soma.open('test_dataframe')
print(sdf.schema)

Output after this PR:

soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=binary, indices=int8, ordered=0> not null
byte: large_binary not null

Changes: Report binary appropriately

Notes for Reviewer:

@johnkerl johnkerl requested review from jp-dark and nguyenv March 22, 2024 16:37
@johnkerl johnkerl marked this pull request as ready for review March 22, 2024 16:38
@johnkerl johnkerl marked this pull request as draft March 22, 2024 16:42
@johnkerl
Copy link
Member Author

One problem: when I re-run @bkmartinjr 's data-generator script (the one in the description of this PR which creates test_dataframe) now its output is like this:

$ ./2305.py
** Original Pandas schema
soma_joinid       int64
int_cat        category
int               int64
str_cat        category
str              object
byte_cat       category
byte             object
dtype: object
soma_joinid: dtype('int64')
int_cat: CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64)
int: dtype('int64')
str_cat: CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object)
str: dtype('O')
byte_cat: CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object)
byte: dtype('O')
-----
** Arrow schema, derived from Pandas
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 964
-----
** Arrow Table derived from pandas
pyarrow.Table
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
----
soma_joinid: [[0,1,2,3,4,5]]
int_cat: [  -- dictionary:
[10,20]  -- indices:
[0,1,0,1,1,1]]
int: [[10,20,10,20,20,20]]
str_cat: [  -- dictionary:
["A","B"]  -- indices:
[0,1,0,1,1,1]]
str: [["A","B","A","B","B","B"]]
byte_cat: [  -- dictionary:
[41,42]  -- indices:
[0,1,0,1,1,1]]
byte: [[41,42,41,42,42,42]]
-----
**Created TileDB Array schema
soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=binary, indices=int8, ordered=0> not null
byte: large_binary not null
Traceback (most recent call last):
  File "/Users/johnkerl/Desktop/./2305.py", line 64, in <module>
    main()
  File "/Users/johnkerl/Desktop/./2305.py", line 50, in main
    df = soma_dataframe.read().concat().to_pandas()
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_read_iters.py", line 74, in concat
    return pa.concat_tables(self)
  File "pyarrow/table.pxi", line 5366, in pyarrow.lib.concat_tables
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_read_iters.py", line 70, in __next__
    return next(self._reader)
  File "/Users/johnkerl/git/single-cell-data/TileDB-SOMA/apis/python/src/tiledbsoma/_read_iters.py", line 464, in _arrow_table_reader
    tbl = sr.read_next()
  File "pyarrow/array.pxi", line 1628, in pyarrow.lib.Array._import_from_c
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected 3 buffers for imported type binary, ArrowArray struct has 2

@nguyenv
Copy link
Member

nguyenv commented Mar 22, 2024

https://github.com/single-cell-data/TileDB-SOMA/tree/viviannguyen/array-write-path

I actually already fixed this in the write path refactor now that enumerations are getting handled in the C++ side instead of in Python. Overall though, I still have 21 failing unit tests for this PR, so I don't think I can get my fix out immediately...

(soma-3.11) vivian@mangonada:~/tiledb-bugs$ python 2305.py
** Original Pandas schema
soma_joinid       int64
int_cat        category
int               int64
str_cat        category
str              object
byte_cat       category
byte             object
dtype: object
soma_joinid: dtype('int64')
int_cat: CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64)
int: dtype('int64')
str_cat: CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object)
str: dtype('O')
byte_cat: CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object)
byte: dtype('O')
-----
** Arrow schema, derived from Pandas
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 964
-----
** Arrow Table derived from pandas
pyarrow.Table
soma_joinid: int64
int_cat: dictionary<values=int64, indices=int8, ordered=0>
int: int64
str_cat: dictionary<values=string, indices=int8, ordered=0>
str: string
byte_cat: dictionary<values=binary, indices=int8, ordered=0>
byte: binary
----
soma_joinid: [[0,1,2,3,4,5]]
int_cat: [  -- dictionary:
[10,20]  -- indices:
[0,1,0,1,1,1]]
int: [[10,20,10,20,20,20]]
str_cat: [  -- dictionary:
["A","B"]  -- indices:
[0,1,0,1,1,1]]
str: [["A","B","A","B","B","B"]]
byte_cat: [  -- dictionary:
[41,42]  -- indices:
[0,1,0,1,1,1]]
byte: [[41,42,41,42,42,42]]
-----
**Created TileDB Array schema
soma_joinid: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
int: int64 not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
str: large_string not null
byte_cat: dictionary<values=binary, indices=int8, ordered=0> not null
byte: large_binary not null
soma_joinid: dtype('int64'), dtype('int64')
int_cat: CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64), CategoricalDtype(categories=[10, 20], ordered=False, categories_dtype=int64)
Categories dtype: dtype('int64'), dtype('int64')
int: dtype('int64'), dtype('int64')
str_cat: CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object), CategoricalDtype(categories=['A', 'B'], ordered=False, categories_dtype=object)
Categories dtype: dtype('O'), dtype('O')
str: dtype('O'), dtype('O')
byte_cat: CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object), CategoricalDtype(categories=[b'A', b'B'], ordered=False, categories_dtype=object)
Categories dtype: dtype('O'), dtype('O')
byte: dtype('O'), dtype('O')
   soma_joinid int_cat  int str_cat str byte_cat  byte
0            0      10   10       A   A     b'A'  b'A'
1            1      20   20       B   B     b'B'  b'B'
2            2      10   10       A   A     b'A'  b'A'
3            3      20   20       B   B     b'B'  b'B'
4            4      20   20       B   B     b'B'  b'B'
5            5      20   20       B   B     b'B'  b'B'

Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Merging #2305 (f81d4c4) into main (b613f3b) will decrease coverage by 0.92%.
The diff coverage is 0.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2305      +/-   ##
==========================================
- Coverage   78.74%   77.82%   -0.92%     
==========================================
  Files         140      140              
  Lines       10750    10755       +5     
  Branches      215      216       +1     
==========================================
- Hits         8465     8370      -95     
- Misses       2186     2298     +112     
+ Partials       99       87      -12     
Flag Coverage Δ
libtiledbsoma 63.67% <0.00%> (-3.89%) ⬇️
python 90.58% <ø> (ø)
r 74.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.58% <ø> (ø)
libtiledbsoma 48.18% <0.00%> (-0.51%) ⬇️

@johnkerl johnkerl force-pushed the kerl/arrow-schema-bytes branch from 45cbcb4 to 0b06695 Compare March 22, 2024 18:21
@johnkerl johnkerl requested a review from ihnorton March 22, 2024 20:18
@johnkerl johnkerl marked this pull request as ready for review March 22, 2024 20:19
@@ -377,6 +383,8 @@ std::string_view ArrowAdapter::to_arrow_format(
tiledb_datatype_t datatype, bool use_large) {
switch (datatype) {
case TILEDB_STRING_ASCII:
return use_large ? "Z" : "z"; // large because TileDB
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I seem to recall that STRING_ASCII along with some other flag, metadata something-something, is used to differentiate between string-as-bytes and string-as-string ... @nguyenv / @ihnorton perhaps you recall how TileDB-Py does this (as it successfully does, in the code snippet included in the description field of this PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specially handle the string types for enums like that rather relying on to_tiledb_format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyenv both of the links you sent me open up with "57 files changed" -- non-navigable

Nonetheless I'll take a look at what you wrote in Slack 🤞

@johnkerl
Copy link
Member Author

Thanks @nguyenv for the advice! CI is green, and my interactive use as noted in the description field of this PR also works great. :D

cc @ihnorton @jp-dark or anyone else still on line at this late hour for approval; or we can do this Monday.

@johnkerl johnkerl changed the title [c++] Fix display of Arrow schema for enum of bytes [c++] Fix display of Arrow schema for enum of bytes datatype Mar 22, 2024
Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the script from #2306 to a python repr test? Otherwise LGTM.

@johnkerl
Copy link
Member Author

johnkerl commented Mar 24, 2024

Thanks @ihnorton

I was delaying the merge for a unit-test case

It'll be more 'somatic' to test it like this:

    with soma.open(uri) as sdf:

        f = sdf.schema.field("int_cat")
        assert f.type.index_type == pa.int8()
        assert f.type.value_type == pa.int64()


        f = sdf.schema.field("str_cat")
        assert f.type.index_type == pa.int8()
        assert f.type.value_type == pa.string()


        f = sdf.schema.field("byte_cat")
        assert f.type.index_type == pa.int8()
        assert f.type.value_type == pa.binary()

with data as ☝️ and I'll get that in before merging

@johnkerl
Copy link
Member Author

johnkerl commented Mar 24, 2024

@nguyenv this is still not right :(

When I said this above:

Thanks @nguyenv for the advice! CI is green, and my interactive use as noted in the description field of this PR also works great. :D

I was misreading my own output :(

The original repro script I was using, I had confused myself by putting attrs in the order

soma_joinid: int64 not null
int: int64 not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
str: large_string not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
byte: large_binary not null
byte_cat: dictionary<values=string, indices=int8, ordered=0> not null

A less error-prone ordering is here: https://gist.github.com/johnkerl/6c4a8ee3ceacaaf2e1929873095b2933

And we still have:

import tiledbsoma as soma
sdf = soma.open('test_dataframe')
print(sdf.schema)

outputting

soma_joinid: int64 not null
int: int64 not null
str: large_string not null
byte: large_binary not null
int_cat: dictionary<values=int64, indices=int8, ordered=0> not null
str_cat: dictionary<values=string, indices=int8, ordered=0> not null
byte_cat: dictionary<values=string, indices=int8, ordered=0> not null <---- INCORRECT

even though TileDB-Py reports correctly, with

#!/usr/bin/env python

import os
import sys
import tiledb

A = tiledb.open('test_dataframe')

for i in range(A.schema.nattr):
    attr = A.schema.attr(i)
    try:
        index_type = attr.dtype
        value_type = A.enum(attr.name).dtype
        print(f"enum name={attr.name} index_type={index_type.name} value_type={value_type.name}")
    except tiledb.cc.TileDBError:
        pass # not an enum attr

outputting

enum name=int_cat index_type=int8 value_type=int64
enum name=str_cat index_type=int8 value_type=str32
enum name=byte_cat index_type=int8 value_type=bytes <---- CORRECT

@johnkerl johnkerl requested a review from ihnorton March 24, 2024 17:52
@nguyenv
Copy link
Member

nguyenv commented Mar 24, 2024

OK I will take a look tomorrow.

@johnkerl
Copy link
Member Author

At present with R and this PR we have:

> sdf <- SOMADataFrameOpen('test_dataframe')
> sdf$schema()
Error: Unsupported data type: CHAR

so I'll add an R unit-test case here as well

@johnkerl
Copy link
Member Author

Re my

so I'll add an R unit-test case here as well

we need first to fix R CI failures

@johnkerl
Copy link
Member Author

All current CI is green. There are still a few things wrong I'm seeing interactively for which we had inadequate unit-test coverage. More commits coming soon.

@eddelbuettel
Copy link
Contributor

Keep in mind that these arrow pieces will like benefit from / change under / get simpler the nanoarrow refactoring. However I am currently blocked on that one as I see a lot of Python CI shrapnel (though only from IIRC three test files, the majority passes) that I cannot make real progress on ... as I am a complete neophyte when it comes to Python / C(++) interop.

@johnkerl
Copy link
Member Author

Good commits on this PR; all green; further work tracked on #2324.

@johnkerl johnkerl merged commit dfa1de9 into main Mar 26, 2024
15 checks passed
@johnkerl johnkerl deleted the kerl/arrow-schema-bytes branch March 26, 2024 18:13
github-actions bot pushed a commit that referenced this pull request Mar 26, 2024
* [c++] Fix display of Arrow schema for enum of bytes

* bugfix per @nguyenv

* more of same

* try 3 with help of @nguyenv

* python unit-test case

* extend the unit-test case a bit

* implement @nguyenv advice

* Partial reversal of schema format assignment

* test

---------

Co-authored-by: Dirk Eddelbuettel <[email protected]>
johnkerl added a commit that referenced this pull request Mar 26, 2024
johnkerl added a commit that referenced this pull request Mar 26, 2024
#2325)

* [c++] Fix display of Arrow schema for enum of bytes

* bugfix per @nguyenv

* more of same

* try 3 with help of @nguyenv

* python unit-test case

* extend the unit-test case a bit

* implement @nguyenv advice

* Partial reversal of schema format assignment

* test

---------

Co-authored-by: John Kerl <[email protected]>
Co-authored-by: Dirk Eddelbuettel <[email protected]>
Copy link

This pull request has been linked to Shortcut Story #43673: tiledbsoma 1.9.0.

johnkerl added a commit that referenced this pull request Mar 26, 2024
github-actions bot pushed a commit that referenced this pull request Mar 26, 2024
johnkerl added a commit that referenced this pull request Mar 26, 2024
github-actions bot pushed a commit that referenced this pull request Mar 26, 2024
johnkerl added a commit that referenced this pull request Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants