Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Improve schema-printer for enumerated types #1673

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Sep 13, 2023

Issue and/or context: Context: #866

When we have an enumerated type, reading the whole array as an Arrow table and asking that Arrow table for its schema is correct:

>>> exp.obs.read().concat().schema
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: dictionary<values=string, indices=int8, ordered=0>

but if we only ask the SOMADataFrame for its schema in Arrow form, it is not:

>>> exp.obs.schema
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: int8

On this PR we have exp.obs.schema print the correct result in Python.

Changes:

Notes for Reviewer:

In 2.17 some enum info is in the schema and some is in the array. So @thetorpedodog I would value your insight regarding the open-at-timestamp feature you were closely involved in.

@johnkerl johnkerl changed the base branch from main to viviannguyen/sc-30316/enumerated-data-types-aka-categoricals-aka September 13, 2023 18:24
@johnkerl
Copy link
Member Author

CI is failing
https://github.com/single-cell-data/TileDB-SOMA/actions/runs/6176350855/job/16765107020?pr=1673
b/c core 2.16.3 is being found

tiledbsoma.__version__    1.3.0.post141.dev2997816183
tiledb.version()          0.22.3
core version              2.16.3
anndata.__version__  (ad) 0.9.2
numpy.__version__    (np) 1.23.5
pandas.__version__   (pd) 2.1.0
pyarrow.__version__  (pa) 13.0.0
scanpy.__version__   (sc) 1.9.5
scipy.__version__    (sp) 1.11.2
python__version__         3.10.13

@johnkerl johnkerl changed the title [python] Improve schema-printer [RFC] [python] Improve schema-printer Sep 13, 2023
@johnkerl johnkerl changed the title [python] Improve schema-printer [python] Improve schema-printer for enumerated types Sep 13, 2023
Copy link
Member

@nguyenv nguyenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and works.

@johnkerl johnkerl marked this pull request as ready for review September 14, 2023 14:47
@johnkerl johnkerl merged commit 549d3e7 into viviannguyen/sc-30316/enumerated-data-types-aka-categoricals-aka Sep 14, 2023
@johnkerl johnkerl deleted the kerl/enums-schema-print branch September 14, 2023 14:48
nguyenv pushed a commit that referenced this pull request Sep 14, 2023
* [python] Improve schema-printer [RFC]

* neaten

* code-neaten inspired by dirk's 1675
johnkerl added a commit that referenced this pull request Sep 15, 2023
* [c++] Support `Enumeration` in C++ Codebase

* Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr`
* Attach an enumeration/dictionary to the `ColumnBuffer` if applicable;
  this is used when converting from `ArrayBuffers` to Arrow Tables in
  the Python and R APIs

* Add `get_attr_to_enum_mapping` Function

* Add Unit Tests for Enumeration in C++

* `to_varlen_buffers` Returns `std::string`

* Prior to TileDB-Inc/TileDB#4272, the SOMA unit
tests were erroneously writing a byte vector for string dimensions which
maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII`

* WIP support enumeration for schema

* [python][wip][nomerge] Support Enumerations in Python (writes)

* Run pre-commit hook

* [python] Expand unit-testing for enumerated types

* used pre-prepared input for categorical-int-nan data

* [python] Support Enumerations On Nullable Attributes and Query Conditions

* [c++] Support `Enumeration` in C++ Codebase

* Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr`
* Attach an enumeration/dictionary to the `ColumnBuffer` if applicable;
  this is used when converting from `ArrayBuffers` to Arrow Tables in
  the Python and R APIs

* Add `get_attr_to_enum_mapping` Function

* Add Unit Tests for Enumeration in C++

* `to_varlen_buffers` Returns `std::string`

* Prior to TileDB-Inc/TileDB#4272, the SOMA unit
tests were erroneously writing a byte vector for string dimensions which
maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII`

* WIP

* Do Not Index Chunk For Empty Arrow Column

* fix `TypeError: Object of type bool_ is not JSON serializable`

* Use TileDB-Py 0.22.3

* Use Dict For Typing

* Update Typing

* Use typed ndarray

* More Typing Corrections

* Recomment tiledb-py dep

* check ifattr exists

* sandbox update

* Use tiledb-py dep; typing

* [c++] Support `Enumeration` in C++ Codebase

* Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr`
* Attach an enumeration/dictionary to the `ColumnBuffer` if applicable;
  this is used when converting from `ArrayBuffers` to Arrow Tables in
  the Python and R APIs

* Add `get_attr_to_enum_mapping` Function

* Add Unit Tests for Enumeration in C++

* `to_varlen_buffers` Returns `std::string`

* Prior to TileDB-Inc/TileDB#4272, the SOMA unit
tests were erroneously writing a byte vector for string dimensions which
maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII`

* WIP support enumeration for schema

* [python][wip][nomerge] Support Enumerations in Python (writes)

* Run pre-commit hook

* [python] Expand unit-testing for enumerated types

* used pre-prepared input for categorical-int-nan data

* [python] Support Enumerations On Nullable Attributes and Query Conditions

* WIP

* Do Not Index Chunk For Empty Arrow Column

* fix `TypeError: Object of type bool_ is not JSON serializable`

* Use Dict For Typing

* Update Typing

* Use typed ndarray

* More Typing Corrections

* Recomment tiledb-py dep

* Use tiledb-py dep; typing

* [python] Leverage bounding-box feature for obsm/varm outgest robustness (#1650)

* temp

* robustness

* extract method for obsm/varm outgest

* complete rebase to main

* more unit-test cases

* remove R debugs

* robustness

* complete rebase to main

* [python] Leverage bounding-box feature for obsm/varm outgest robustness

* test data for holey obsm

* unit-test cases

* on-line help improvements

* [python] Update default-filter-list handling in unit tests (#1676)

* [python] Improve schema-printer for enumerated types (#1673)

* [python] Improve schema-printer [RFC]

* neaten

* code-neaten inspired by dirk's 1675

* 2.17.0

* fix merge

* merge

* pre-commit

---------

Co-authored-by: John Kerl <[email protected]>
Co-authored-by: John Kerl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants