Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nullable string columns #679

Closed
ivirshup opened this issue Jan 11, 2022 · 15 comments · Fixed by #1558
Closed

Nullable string columns #679

ivirshup opened this issue Jan 11, 2022 · 15 comments · Fixed by #1558
Assignees
Milestone

Comments

@ivirshup
Copy link
Member

Split off from #504

It would be nice to have support for nullable string arrays. It would be good to have a consistent in-memory representation for these so we can reason about performance. However, this does not currently exist in our dependency stack. I currently think this feature will be dependent on upstream developments in pandas StringArray type.

This is less urgent than nullable integers and booleans since we already have nullable categorical arrays, and currently aggressively cast strings to categorical for performance reasons.

@ivirshup
Copy link
Member Author

ivirshup commented Sep 16, 2022

Pandas now has multiple ways of doing strings, custom and via arrow.

I think we could still handle both these cases by just storing a mask. This would just be a little inefficient, but we can always update.

Maybe we could even handle arrow bit masks if those seem to be the path forward (docs for bit masks, docs for np.packbits)

@pcm32
Copy link

pcm32 commented Dec 1, 2022

This is also relevant for the output of sc.tl.filter_rank_genes_groups in scanpy, which makes some of the genes in the newly created uns part nan:

adata.uns['rank_genes_groups_filtered']['names'][0]
(nan, nan, 'NKG7', nan, nan, 'PPBP')

I'm adding it here since this seems to be related to the issue of h5py > 3.0 not being happy with casting non strings to strings:

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

Let me know if you prefer me to open a new issue on Scanpy.

@ivirshup
Copy link
Member Author

ivirshup commented Dec 1, 2022

This issue wouldn't apply for rank genes groups because that object is a record array, while this issue addresses dataframe columns specifically.

@pcm32
Copy link

pcm32 commented Dec 1, 2022

No problem, should I open a new one here or on scanpy?

@ivirshup
Copy link
Member Author

@pcm32, sorry for the late response here. Came at a busy time of year.

I believe there will already be issues open on scanpy for this.

@ivirshup
Copy link
Member Author

ivirshup commented Feb 28, 2023

About implementation for nullable string support:

This is somewhat complicated by pandas having multiple backends for nullable string arrays (pyarrow and pd.StringDtype).

We probably want to go with an on disk representation of arrays similar to the arrow in memory representation, but it seems configurable in pandas whether we get the pyarrow representation or the pandas rep. We also don't want to add a hard dependency on pyarrow.

I'm also not sure how we can go from the pandas representation to something writable. We can easily get the masks (.isna()), but I don't know what to handle the "data" containing pd.NA values. I think we probably just make a copy of the array and fill replace the NA entries with some other string, but idk that there's a great choice here.

@ivirshup ivirshup added this to the 0.9 milestone Mar 1, 2023
@ivirshup
Copy link
Member Author

ivirshup commented Mar 1, 2023

This is starting to come up more frequently, and will likely be even more of an issue with the next release of pandas (which is coming soon).

To head that off, I think I'm going to add this feature for 0.9

@nroak
Copy link

nroak commented Mar 8, 2023

Thank you for this update, is there an estimated timeline for this issue to be patched? I'm facing it exactly where you explained- due to nan in ranked gene lists.

@flying-sheep
Copy link
Member

OK, to be clear: This issue means support for pandas.core.arrays.string_.StringArray as described in #963, right?

@MishalAshraf
Copy link

had a similar issue when trying to concatenate objects. one dataset has a boolean obs column, the other does not. when they are combined it becomes True/False/NaN. I agree that it's up to the user to determine and explicitly define how they want to handle this.

For me, this is 1 step in a longer pipeline for data exploration. It would be nice to have a lazy save here where I concatenate multiple datasets without explicitly resolving which columns will be informative later on.

@flying-sheep flying-sheep modified the milestones: 0.9, 0.10.0 Jul 21, 2023
@flying-sheep
Copy link
Member

milestone was missed, bumping to 0.10.0

@jacksonloper
Copy link

So basically the current best-practice workaround is to cast to Categorical, is that right?

@flying-sheep flying-sheep self-assigned this Jul 2, 2024
@flying-sheep
Copy link
Member

flying-sheep commented Jul 9, 2024

@ilan-gold
Copy link
Contributor

Potential solution (also going forward): use settings!

@flying-sheep
Copy link
Member

To expand on it: We want opt-in writing so people don’t accidentally write files that can’t be read by most people.

So we’d add a setting that enables writing pd.StringDtype series, and if one of them is encountered without the setting being active, we throw an error that explains how to enable it and why it’s disabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants