-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Use true ASCII attributes in dataframes #359
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Oct 3, 2022
johnkerl
force-pushed
the
kerl/ascii
branch
8 times, most recently
from
October 3, 2022 04:16
5820757
to
c89409a
Compare
johnkerl
force-pushed
the
kerl/pyarrow-table
branch
7 times, most recently
from
October 4, 2022 03:10
4fa2285
to
8193c21
Compare
johnkerl
force-pushed
the
kerl/ascii
branch
4 times, most recently
from
October 4, 2022 14:16
973dc7d
to
9215877
Compare
johnkerl
force-pushed
the
kerl/pyarrow-table
branch
from
October 4, 2022 14:43
f260372
to
ae2bde1
Compare
johnkerl
force-pushed
the
kerl/pyarrow-table
branch
from
October 4, 2022 14:45
ae2bde1
to
5ccef0e
Compare
johnkerl
force-pushed
the
kerl/ascii
branch
2 times, most recently
from
October 4, 2022 15:06
0661600
to
681b743
Compare
johnkerl
force-pushed
the
kerl/pyarrow-table
branch
from
October 4, 2022 15:19
5ccef0e
to
9d2a8f4
Compare
johnkerl
force-pushed
the
kerl/ascii
branch
2 times, most recently
from
October 11, 2022 00:18
46b4ae3
to
a8bb90b
Compare
johnkerl
changed the title
Use true ASCII attributes in dataframes [WIP]
Use true ASCII attributes in dataframes
Oct 11, 2022
fyi, now that python unit-test cases are passing, there are C++ unit-test failures which i am currently debugging |
fixed on latest commit: |
Note: tests will fail with
until TileDB-Inc/TileDB-Py#1337 wends its way into a |
gspowley
approved these changes
Oct 11, 2022
This was referenced Oct 12, 2022
Closed
johnkerl
changed the title
Use true ASCII attributes in dataframes
[python] Use true ASCII attributes in dataframes
Oct 26, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What
This is a forward-port of #273 from the
main-old
branch to themain
branch.Why
Lack of backward compatibility
main-old
(see [python] Update ASCII storage for dataframes #273) we conditionally retained the decode-on-readback logic which was essential since otherwise b"foo" != "foo"main
(greenfield) branch we do not keep backward compatibility -- any pre-existing datasets with non-ASCII attributes will need to be regeneratedPrerequisites
TileDB-Py
0.17.5 which is here as of [python] Depend on TileDB-Py 0.17.5 #394PR context
This is the second in a group of three related PRs:
read
returnpyarrow.Table
notpyarrow.RecordBatch
(as in an outdated version of that spec)main-old
which will truly have ASCII columns, obviating the need for ourutil_arrow.ascii_to_unicode_pyarrow_readback
read
methods, which will go in cleanly nowpyarrow.Table
and with the first PR our unit tests will be ready to gopyarrow.LargeBinaryArray
(needing decode) but when we are properly writing ASCII cells via the Python write path then the C++ code will read ASCII cells and return them as strings (no longer needing decoding)These three changesets could be done in a single PR, but that would be unmerciful to the reviewers.
Problem
Issue I'm seeing on this PR which confuses me:
Script:
TileDB schema:
Reading back from C++ or from Python:
In summary:
"ascii"
attributeslarge_string
notstring
in the Arrow readback schemacc @gspowley and @nguyenv
@nguyenv I'll get you a simpler reprex tomorrow
Update 2022-10-04: simpler repro https://gist.github.com/johnkerl/278bb246c3aa05cc4dfb1694503744c6