-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nullable string columns #679
Comments
Pandas now has multiple ways of doing strings, custom and via arrow. I think we could still handle both these cases by just storing a mask. This would just be a little inefficient, but we can always update. Maybe we could even handle arrow bit masks if those seem to be the path forward (docs for bit masks, docs for np.packbits) |
This is also relevant for the output of sc.tl.filter_rank_genes_groups in scanpy, which makes some of the genes in the newly created uns part
I'm adding it here since this seems to be related to the issue of h5py > 3.0 not being happy with casting non strings to strings:
Let me know if you prefer me to open a new issue on Scanpy. |
This issue wouldn't apply for rank genes groups because that object is a record array, while this issue addresses dataframe columns specifically. |
No problem, should I open a new one here or on scanpy? |
@pcm32, sorry for the late response here. Came at a busy time of year. I believe there will already be issues open on scanpy for this. |
About implementation for nullable string support: This is somewhat complicated by pandas having multiple backends for nullable string arrays (pyarrow and pd.StringDtype). We probably want to go with an on disk representation of arrays similar to the arrow in memory representation, but it seems configurable in pandas whether we get the pyarrow representation or the pandas rep. We also don't want to add a hard dependency on pyarrow. I'm also not sure how we can go from the pandas representation to something writable. We can easily get the masks ( |
This is starting to come up more frequently, and will likely be even more of an issue with the next release of pandas (which is coming soon). To head that off, I think I'm going to add this feature for 0.9 |
Thank you for this update, is there an estimated timeline for this issue to be patched? I'm facing it exactly where you explained- due to nan in ranked gene lists. |
OK, to be clear: This issue means support for |
had a similar issue when trying to concatenate objects. one dataset has a boolean obs column, the other does not. when they are combined it becomes True/False/NaN. I agree that it's up to the user to determine and explicitly define how they want to handle this. For me, this is 1 step in a longer pipeline for data exploration. It would be nice to have a lazy save here where I concatenate multiple datasets without explicitly resolving which columns will be informative later on. |
milestone was missed, bumping to 0.10.0 |
So basically the current best-practice workaround is to cast to Categorical, is that right? |
Pandas seems to get a dedicated string array type (the vote is pretty much in favor), but it’s not based on the nullable string array in numpy 2 🤦
nanoarrow can’t write currently but we can just use our |
Potential solution (also going forward): use |
To expand on it: We want opt-in writing so people don’t accidentally write files that can’t be read by most people. So we’d add a setting that enables writing |
Split off from #504
It would be nice to have support for nullable string arrays. It would be good to have a consistent in-memory representation for these so we can reason about performance. However, this does not currently exist in our dependency stack. I currently think this feature will be dependent on upstream developments in pandas StringArray type.
This is less urgent than nullable integers and booleans since we already have nullable categorical arrays, and currently aggressively cast strings to categorical for performance reasons.
The text was updated successfully, but these errors were encountered: