Nullable string columns #679

ivirshup · 2022-01-11T21:00:04Z

Split off from #504

It would be nice to have support for nullable string arrays. It would be good to have a consistent in-memory representation for these so we can reason about performance. However, this does not currently exist in our dependency stack. I currently think this feature will be dependent on upstream developments in pandas StringArray type.

This is less urgent than nullable integers and booleans since we already have nullable categorical arrays, and currently aggressively cast strings to categorical for performance reasons.

ivirshup · 2022-09-16T12:25:59Z

Pandas now has multiple ways of doing strings, custom and via arrow.

I think we could still handle both these cases by just storing a mask. This would just be a little inefficient, but we can always update.

Maybe we could even handle arrow bit masks if those seem to be the path forward (docs for bit masks, docs for np.packbits)

pcm32 · 2022-12-01T10:02:34Z

This is also relevant for the output of sc.tl.filter_rank_genes_groups in scanpy, which makes some of the genes in the newly created uns part nan:

adata.uns['rank_genes_groups_filtered']['names'][0]
(nan, nan, 'NKG7', nan, nan, 'PPBP')

I'm adding it here since this seems to be related to the issue of h5py > 3.0 not being happy with casting non strings to strings:

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

Let me know if you prefer me to open a new issue on Scanpy.

ivirshup · 2022-12-01T16:37:26Z

This issue wouldn't apply for rank genes groups because that object is a record array, while this issue addresses dataframe columns specifically.

pcm32 · 2022-12-01T16:55:04Z

No problem, should I open a new one here or on scanpy?

ivirshup · 2023-02-28T14:27:51Z

@pcm32, sorry for the late response here. Came at a busy time of year.

I believe there will already be issues open on scanpy for this.

ivirshup · 2023-02-28T14:35:25Z

About implementation for nullable string support:

This is somewhat complicated by pandas having multiple backends for nullable string arrays (pyarrow and pd.StringDtype).

We probably want to go with an on disk representation of arrays similar to the arrow in memory representation, but it seems configurable in pandas whether we get the pyarrow representation or the pandas rep. We also don't want to add a hard dependency on pyarrow.

I'm also not sure how we can go from the pandas representation to something writable. We can easily get the masks (.isna()), but I don't know what to handle the "data" containing pd.NA values. I think we probably just make a copy of the array and fill replace the NA entries with some other string, but idk that there's a great choice here.

ivirshup · 2023-03-01T13:42:49Z

This is starting to come up more frequently, and will likely be even more of an issue with the next release of pandas (which is coming soon).

To head that off, I think I'm going to add this feature for 0.9

nroak · 2023-03-08T19:17:25Z

Thank you for this update, is there an estimated timeline for this issue to be patched? I'm facing it exactly where you explained- due to nan in ranked gene lists.

flying-sheep · 2023-06-12T09:09:09Z

OK, to be clear: This issue means support for pandas.core.arrays.string_.StringArray as described in #963, right?

MishalAshraf · 2023-07-17T17:24:41Z

had a similar issue when trying to concatenate objects. one dataset has a boolean obs column, the other does not. when they are combined it becomes True/False/NaN. I agree that it's up to the user to determine and explicitly define how they want to handle this.

For me, this is 1 step in a longer pipeline for data exploration. It would be nice to have a lazy save here where I concatenate multiple datasets without explicitly resolving which columns will be informative later on.

flying-sheep · 2023-07-21T09:12:56Z

milestone was missed, bumping to 0.10.0

jacksonloper · 2024-05-24T18:45:33Z

So basically the current best-practice workaround is to cast to Categorical, is that right?

flying-sheep · 2024-07-09T07:11:17Z

Pandas seems to get a dedicated string array type (the vote is pretty much in favor), but it’s not based on the nullable string array in numpy 2 🤦

PDEP-14: Dedicated string data type for pandas 3.0 pandas-dev/pandas#58551
DISC: Supporting numpy StringDType in Pandas pandas-dev/pandas#58503
ENH: improve StringDType error in from buffer results numpy/numpy#26893
Supporting UTF-8 data type zarr-developers/zarr-specs#83
- Draft ZEP 0007: Strings zarr-developers/zeps#47
- ArrowRecordBatchCodec and vlen string support zarr-developers/zarr-python#2031

nanoarrow can’t write currently

but we can just use our use_nullable_integer code to store nullable string using boolean masks.

ilan-gold · 2024-08-08T15:54:26Z

Potential solution (also going forward): use settings!

flying-sheep · 2024-08-08T16:03:09Z

To expand on it: We want opt-in writing so people don’t accidentally write files that can’t be read by most people.

So we’d add a setting that enables writing pd.StringDtype series, and if one of them is encountered without the setting being active, we throw an error that explains how to enable it and why it’s disabled by default.

ivirshup added enhancement topic: io labels Jan 11, 2022

grst mentioned this issue Jan 12, 2022

Consistent typing in adata.obs scverse/scirpy#190

Closed

jwindhager mentioned this issue Sep 15, 2022

Add panel metadata to AnnData export BodenmillerGroup/steinbock#66

Closed

ivirshup mentioned this issue Jan 25, 2023

Formalize spec #882

Merged

7 tasks

ivirshup mentioned this issue Feb 28, 2023

TypeError: No method has been defined for writing <class 'pandas.core.arrays.string_.StringArray'> elements to <class 'zarr.hierarchy.Group'> #925

Closed

ivirshup added this to the 0.9 milestone Mar 1, 2023

ivirshup mentioned this issue Mar 1, 2023

No support for mixed column type #726

Closed

jeskowagner mentioned this issue Mar 29, 2023

Can't convert string categoricals #963

Closed

DriesSchaumont mentioned this issue Apr 7, 2023

Behavior change when writing after setitem operations with pandas 2.0 vs pandas 1.5.3 scverse/mudata#40

Closed

flying-sheep modified the milestones: 0.9, 0.10.0 Jul 21, 2023

This was referenced Jul 23, 2023

Can't save h5mu from Scirpy processed gex+bcr+tcr data if I copy airr into obs scverse/scirpy#434

Open

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

Open

ivirshup modified the milestones: 0.10.0, 0.11.0 Aug 28, 2023

flying-sheep mentioned this issue Jan 2, 2024

Rethink group IDs in rank_genes_groups scverse/scanpy#61

Open

flying-sheep self-assigned this Jul 2, 2024

flying-sheep mentioned this issue Jul 5, 2024

IO for nullable string arrays #1558

Merged

3 tasks

ivirshup modified the milestones: 0.11.0, 0.12.0 Aug 8, 2024

flying-sheep mentioned this issue Aug 8, 2024

Policy for format updates #1577

Open

ilan-gold mentioned this issue Aug 13, 2024

TypeError when writing string columns to h5ad #1571

Open

3 tasks

flying-sheep closed this as completed in #1558 Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nullable string columns #679

Nullable string columns #679

ivirshup commented Jan 11, 2022

ivirshup commented Sep 16, 2022 •

edited

Loading

pcm32 commented Dec 1, 2022

ivirshup commented Dec 1, 2022

pcm32 commented Dec 1, 2022

ivirshup commented Feb 28, 2023

ivirshup commented Feb 28, 2023 •

edited

Loading

ivirshup commented Mar 1, 2023

nroak commented Mar 8, 2023

flying-sheep commented Jun 12, 2023

MishalAshraf commented Jul 17, 2023

flying-sheep commented Jul 21, 2023

jacksonloper commented May 24, 2024

flying-sheep commented Jul 9, 2024 •

edited

Loading

ilan-gold commented Aug 8, 2024

flying-sheep commented Aug 8, 2024

Nullable string columns #679

Nullable string columns #679

Comments

ivirshup commented Jan 11, 2022

ivirshup commented Sep 16, 2022 • edited Loading

pcm32 commented Dec 1, 2022

ivirshup commented Dec 1, 2022

pcm32 commented Dec 1, 2022

ivirshup commented Feb 28, 2023

ivirshup commented Feb 28, 2023 • edited Loading

ivirshup commented Mar 1, 2023

nroak commented Mar 8, 2023

flying-sheep commented Jun 12, 2023

MishalAshraf commented Jul 17, 2023

flying-sheep commented Jul 21, 2023

jacksonloper commented May 24, 2024

flying-sheep commented Jul 9, 2024 • edited Loading

ilan-gold commented Aug 8, 2024

flying-sheep commented Aug 8, 2024

ivirshup commented Sep 16, 2022 •

edited

Loading

ivirshup commented Feb 28, 2023 •

edited

Loading

flying-sheep commented Jul 9, 2024 •

edited

Loading