`/` in column names makes AnnData Zarr object unreadable on windows #1447

LucaMarconato · 2024-03-27T15:55:07Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of anndata.
(optional) I have confirmed this bug exists on the master branch of anndata.

Report

Reported by a spatialdata Windows user and reproduced on a Windows machine scverse/spatialdata-io#129. I can't reproduce on my macOS machine or on a Linux machine.

When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly. In Windows the column can't be read, probably because of the difference between / and \ for paths.

In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols. The check would be performed when instantiating an object and in particular before writing, prompting the user for a name change.

What are you opinion on this, in particular on restricting the names?

Please see the code and traceback in the attached SpatialData issue, as I can't reproduce on my machine: scverse/spatialdata-io#129.

Versions

-----
anndata             0.10.4
session_info        1.0.0
-----
asciitree           NA
cloudpickle         3.0.0
cython_runtime      NA
cytoolz             0.12.2
dask                2023.12.1
dateutil            2.8.2
dill                0.3.7
exceptiongroup      1.2.0
fasteners           0.17.3
gmpy2               2.1.2
h5py                3.9.0
importlib_metadata  NA
jinja2              3.1.2
markupsafe          2.1.3
mpmath              1.3.0
msgpack             1.0.7
natsort             8.4.0
numcodecs           0.12.1
numpy               1.26.3
packaging           23.2
pandas              2.1.4
psutil              5.9.7
pyarrow             14.0.1
pytz                2023.3.post1
scipy               1.11.4
six                 1.16.0
sphinxcontrib       NA
sympy               1.12
tblib               3.0.0
tlz                 0.12.2
toolz               0.12.0
torch               2.1.2
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
xxhash              NA
yaml                6.0.1
zarr                2.16.1
zipp                NA
zoneinfo            NA
-----
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ]
macOS-13.4.1-arm64-arm-64bit
-----
Session information updated at 2024-03-27 16:54

The text was updated successfully, but these errors were encountered:

ivirshup · 2024-03-27T19:20:44Z

@melonora, can you write a column with a / in it?

When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly.

Basically this just happens to work with how we do columns. The specific group creation behavior could very well be considered a bug.

In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols.

I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).

I recall topics like this being discussed at length in some zarr community calls/ channels. I'd recommend checking over there for any recommendations/ solutions.

melonora · 2024-03-27T21:15:54Z

I did, the zarr store with current version of spatialdata and anndata version 0.10.3 creates an _index zarr array in the var zgroup. This is different from what we observe with the steinbock data. In the case when manually adjusting var names to include a / in the name did not lead to any problems.

melonora · 2024-03-27T21:16:59Z

@LucaMarconato is the uploaded steinbock date perhaps outdated?

LucaMarconato · 2024-03-27T22:12:26Z

I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).

I agree with you, I would be up for allowing all the characters for this reason. The only reason why I wanted to restrict the name is that I would like to minimize the risk of weird behaviors if the name of the element is interpreted as a path. Things like /, ../, C:\\path\ etc. Maybe a solution is to disallow certain characters, like the ones that the OS prevents you to use in filenames.

LucaMarconato · 2024-03-27T22:13:42Z

@melonora, I have just verified. The Steinbock example is up-to-date and it's using the latest anndata 0.10.6.

Did I understand correctly: you can reproduce the bug only when the anndata io is handled by spatialdata, but not when using anndata directly? In that case it could be, maytbe, that somewhere we pass a Zarr group instead of a path, and maybe this causes the Zarr library to create the subpath when creating the Zarr group.

melonora · 2024-03-27T22:47:44Z

This is the anndata io by spatialdata. With the merfish example when I adjust the var names I get a var zarr group with this:

Reading it back in and checking table.var in the sdata object gives me:

So in short when adjusting the table in the merfish example I am able to write to zarr and then read back in.

LucaMarconato · 2024-03-27T22:59:14Z

Thanks for clarifying. But really not sure what's going on here then.

melonora · 2024-03-27T23:01:26Z

Yeah the behaviour is really different. The encoding_type of var in this case is a dataframe. This is specified in .zattrs in var.

ivirshup · 2024-03-28T12:18:35Z

I'm a little confused here. I thought the issue was with column names, not row names?

@melonora, if you do:

from anndata.experimental import read_elem, write_elem

import pandas as pd
import zarr

g = zarr.open("test_df.zarr", "w+")
df = pd.DataFrame({"col / with/ slashes": [1,2,3]})
write_elem(g, "df", df)
from_disk = read_elem(g["df"])

from_disk

what do you get?

melonora · 2024-03-28T19:59:02Z

This gives an error because of the / being interpreted as a path separator. The thing that I find weird is that the issue arises with the Steinbock dataset where we have this for table.var. When I try to replicate this issue by simply adjusting one of the indices in var to include / I don't seem to have the problem while for the Steinbock dataset I do.

melonora · 2024-03-28T20:01:48Z

To be specific this raises a FileNotFoundError as a result of a path with forward slashes not being found:

github-actions · 2024-05-28T02:18:01Z

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

LucaMarconato added the Bug 🐛 label Mar 27, 2024

ivirshup changed the title ~~/ in column names makes AnnData Zarr object unreadable~~ / in column names makes AnnData Zarr object unreadable on windows Mar 27, 2024

github-actions bot added the stale label May 28, 2024

LucaMarconato mentioned this issue Jul 9, 2024

Naming constraints break compatibility with existing datasets scverse/spatialdata#624

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`/` in column names makes AnnData Zarr object unreadable on windows #1447

`/` in column names makes AnnData Zarr object unreadable on windows #1447

LucaMarconato commented Mar 27, 2024

ivirshup commented Mar 27, 2024

melonora commented Mar 27, 2024

melonora commented Mar 27, 2024 •

edited

Loading

LucaMarconato commented Mar 27, 2024

LucaMarconato commented Mar 27, 2024 •

edited

Loading

melonora commented Mar 27, 2024 •

edited

Loading

LucaMarconato commented Mar 27, 2024 •

edited

Loading

melonora commented Mar 27, 2024

ivirshup commented Mar 28, 2024

melonora commented Mar 28, 2024

melonora commented Mar 28, 2024

github-actions bot commented May 28, 2024

/ in column names makes AnnData Zarr object unreadable on windows #1447

/ in column names makes AnnData Zarr object unreadable on windows #1447

Comments

LucaMarconato commented Mar 27, 2024

Please make sure these conditions are met

Report

Versions

ivirshup commented Mar 27, 2024

melonora commented Mar 27, 2024

melonora commented Mar 27, 2024 • edited Loading

LucaMarconato commented Mar 27, 2024

LucaMarconato commented Mar 27, 2024 • edited Loading

melonora commented Mar 27, 2024 • edited Loading

LucaMarconato commented Mar 27, 2024 • edited Loading

melonora commented Mar 27, 2024

ivirshup commented Mar 28, 2024

melonora commented Mar 28, 2024

melonora commented Mar 28, 2024

github-actions bot commented May 28, 2024

`/` in column names makes AnnData Zarr object unreadable on windows #1447

`/` in column names makes AnnData Zarr object unreadable on windows #1447

melonora commented Mar 27, 2024 •

edited

Loading

LucaMarconato commented Mar 27, 2024 •

edited

Loading

melonora commented Mar 27, 2024 •

edited

Loading

LucaMarconato commented Mar 27, 2024 •

edited

Loading