Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Support return_arrow for various queries #256

Merged
merged 2 commits into from
Aug 26, 2022
Merged

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Aug 24, 2022

Note that ingest-from-arrow is a separate issue and is not covered on this PR.

@johnkerl johnkerl marked this pull request as draft August 24, 2022 17:20
@johnkerl johnkerl force-pushed the kerl/arrow-external branch 6 times, most recently from 4597a0d to 35936f8 Compare August 25, 2022 00:49
@johnkerl johnkerl changed the title Support return_arrow for various queries [WIP] Support return_arrow for various queries Aug 25, 2022
@johnkerl johnkerl marked this pull request as ready for review August 25, 2022 00:50
else:
return (array_number, False, None)

def _ascii_to_unicode_arrow_readback(self, df: pa.Table) -> pa.Table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this function? This makes a full copy of fields and then creates a new pyarrow table from python arrays. Why do we need to perform a copy?

Copy link
Member Author

@johnkerl johnkerl Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shelnutt2 it's a separate function so as to be called from the ThreadPoolExecutor -- I'm open to feedback if you have a better way.

Or if you're asking why copy at all -- please notice that it does a copy+cast only when the need to do ASCII-to-Unicode. (It's careful to return None if nothing needs conversion.) And this conversion is crucial -- not optional -- since without this user-level queries like 'cell_type == "myeloid cell"' would return nothing since columns would have values like b"myeloid cell" which is != to "myeloid cell".

Please see

We had extensive discussions on this a couple months ago. We must (can't not) do Unicode-to-ASCII converstions on writes and ASCII-to-Unicode conversions on reads until such time as we have core mods in place which:

  • Permit string attrs to be Unicode and be queryable using query conditions
  • Permit string dims to be Unicode at all

cc @ihnorton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be followed up separately.

@johnkerl johnkerl requested a review from Shelnutt2 August 25, 2022 19:00
@johnkerl johnkerl force-pushed the kerl/arrow-external branch from 35936f8 to a2b8cc8 Compare August 26, 2022 13:06
@johnkerl johnkerl merged commit 879e341 into main Aug 26, 2022
@johnkerl johnkerl deleted the kerl/arrow-external branch August 26, 2022 13:16
@johnkerl johnkerl changed the title Support return_arrow for various queries [python] Support return_arrow for various queries Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants