-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Improve schema-printer for enumerated types #1673
Merged
johnkerl
merged 3 commits into
viviannguyen/sc-30316/enumerated-data-types-aka-categoricals-aka
from
kerl/enums-schema-print
Sep 14, 2023
Merged
[python] Improve schema-printer for enumerated types #1673
johnkerl
merged 3 commits into
viviannguyen/sc-30316/enumerated-data-types-aka-categoricals-aka
from
kerl/enums-schema-print
Sep 14, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
johnkerl
changed the base branch from
main
to
viviannguyen/sc-30316/enumerated-data-types-aka-categoricals-aka
September 13, 2023 18:24
CI is failing
|
johnkerl
changed the title
[python] Improve schema-printer [RFC]
[python] Improve schema-printer
Sep 13, 2023
johnkerl
changed the title
[python] Improve schema-printer
[python] Improve schema-printer for enumerated types
Sep 13, 2023
nguyenv
approved these changes
Sep 14, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally and works.
johnkerl
merged commit Sep 14, 2023
549d3e7
into
viviannguyen/sc-30316/enumerated-data-types-aka-categoricals-aka
nguyenv
pushed a commit
that referenced
this pull request
Sep 14, 2023
* [python] Improve schema-printer [RFC] * neaten * code-neaten inspired by dirk's 1675
johnkerl
added a commit
that referenced
this pull request
Sep 15, 2023
* [c++] Support `Enumeration` in C++ Codebase * Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr` * Attach an enumeration/dictionary to the `ColumnBuffer` if applicable; this is used when converting from `ArrayBuffers` to Arrow Tables in the Python and R APIs * Add `get_attr_to_enum_mapping` Function * Add Unit Tests for Enumeration in C++ * `to_varlen_buffers` Returns `std::string` * Prior to TileDB-Inc/TileDB#4272, the SOMA unit tests were erroneously writing a byte vector for string dimensions which maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII` * WIP support enumeration for schema * [python][wip][nomerge] Support Enumerations in Python (writes) * Run pre-commit hook * [python] Expand unit-testing for enumerated types * used pre-prepared input for categorical-int-nan data * [python] Support Enumerations On Nullable Attributes and Query Conditions * [c++] Support `Enumeration` in C++ Codebase * Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr` * Attach an enumeration/dictionary to the `ColumnBuffer` if applicable; this is used when converting from `ArrayBuffers` to Arrow Tables in the Python and R APIs * Add `get_attr_to_enum_mapping` Function * Add Unit Tests for Enumeration in C++ * `to_varlen_buffers` Returns `std::string` * Prior to TileDB-Inc/TileDB#4272, the SOMA unit tests were erroneously writing a byte vector for string dimensions which maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII` * WIP * Do Not Index Chunk For Empty Arrow Column * fix `TypeError: Object of type bool_ is not JSON serializable` * Use TileDB-Py 0.22.3 * Use Dict For Typing * Update Typing * Use typed ndarray * More Typing Corrections * Recomment tiledb-py dep * check ifattr exists * sandbox update * Use tiledb-py dep; typing * [c++] Support `Enumeration` in C++ Codebase * Addition of `SOMAArray::get_enum` and `SOMAArray::get_enum_label_on_attr` * Attach an enumeration/dictionary to the `ColumnBuffer` if applicable; this is used when converting from `ArrayBuffers` to Arrow Tables in the Python and R APIs * Add `get_attr_to_enum_mapping` Function * Add Unit Tests for Enumeration in C++ * `to_varlen_buffers` Returns `std::string` * Prior to TileDB-Inc/TileDB#4272, the SOMA unit tests were erroneously writing a byte vector for string dimensions which maps to `TILEDB_BLOB` rather than `TILEDB_STRING_ASCII` * WIP support enumeration for schema * [python][wip][nomerge] Support Enumerations in Python (writes) * Run pre-commit hook * [python] Expand unit-testing for enumerated types * used pre-prepared input for categorical-int-nan data * [python] Support Enumerations On Nullable Attributes and Query Conditions * WIP * Do Not Index Chunk For Empty Arrow Column * fix `TypeError: Object of type bool_ is not JSON serializable` * Use Dict For Typing * Update Typing * Use typed ndarray * More Typing Corrections * Recomment tiledb-py dep * Use tiledb-py dep; typing * [python] Leverage bounding-box feature for obsm/varm outgest robustness (#1650) * temp * robustness * extract method for obsm/varm outgest * complete rebase to main * more unit-test cases * remove R debugs * robustness * complete rebase to main * [python] Leverage bounding-box feature for obsm/varm outgest robustness * test data for holey obsm * unit-test cases * on-line help improvements * [python] Update default-filter-list handling in unit tests (#1676) * [python] Improve schema-printer for enumerated types (#1673) * [python] Improve schema-printer [RFC] * neaten * code-neaten inspired by dirk's 1675 * 2.17.0 * fix merge * merge * pre-commit --------- Co-authored-by: John Kerl <[email protected]> Co-authored-by: John Kerl <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue and/or context: Context: #866
When we have an enumerated type, reading the whole array as an Arrow table and asking that Arrow table for its schema is correct:
but if we only ask the
SOMADataFrame
for its schema in Arrow form, it is not:On this PR we have
exp.obs.schema
print the correct result in Python.Changes:
Notes for Reviewer:
In 2.17 some enum info is in the schema and some is in the array. So @thetorpedodog I would value your insight regarding the open-at-timestamp feature you were closely involved in.