Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/ in column names makes AnnData Zarr object unreadable on windows #1447

Open
2 of 3 tasks
LucaMarconato opened this issue Mar 27, 2024 · 12 comments
Open
2 of 3 tasks

Comments

@LucaMarconato
Copy link
Member

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Reported by a spatialdata Windows user and reproduced on a Windows machine scverse/spatialdata-io#129. I can't reproduce on my macOS machine or on a Linux machine.

When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly. In Windows the column can't be read, probably because of the difference between / and \ for paths.

In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols. The check would be performed when instantiating an object and in particular before writing, prompting the user for a name change.

What are you opinion on this, in particular on restricting the names?

Please see the code and traceback in the attached SpatialData issue, as I can't reproduce on my machine: scverse/spatialdata-io#129.

Versions

-----
anndata             0.10.4
session_info        1.0.0
-----
asciitree           NA
cloudpickle         3.0.0
cython_runtime      NA
cytoolz             0.12.2
dask                2023.12.1
dateutil            2.8.2
dill                0.3.7
exceptiongroup      1.2.0
fasteners           0.17.3
gmpy2               2.1.2
h5py                3.9.0
importlib_metadata  NA
jinja2              3.1.2
markupsafe          2.1.3
mpmath              1.3.0
msgpack             1.0.7
natsort             8.4.0
numcodecs           0.12.1
numpy               1.26.3
packaging           23.2
pandas              2.1.4
psutil              5.9.7
pyarrow             14.0.1
pytz                2023.3.post1
scipy               1.11.4
six                 1.16.0
sphinxcontrib       NA
sympy               1.12
tblib               3.0.0
tlz                 0.12.2
toolz               0.12.0
torch               2.1.2
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
xxhash              NA
yaml                6.0.1
zarr                2.16.1
zipp                NA
zoneinfo            NA
-----
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ]
macOS-13.4.1-arm64-arm-64bit
-----
Session information updated at 2024-03-27 16:54
@ivirshup ivirshup changed the title / in column names makes AnnData Zarr object unreadable / in column names makes AnnData Zarr object unreadable on windows Mar 27, 2024
@ivirshup
Copy link
Member

@melonora, can you write a column with a / in it?


When / is part of a name of a var column, the data is written to disk in a subfolder (also in macOS, see screenshot) and can still be read correctly.

Basically this just happens to work with how we do columns. The specific group creation behavior could very well be considered a bug.

In spatialdata, we are considering checking all the element names and their respective element columns (e.g. GeoDataFrame column names, AnnData obs/var/... column names, etc) and allowing only strings with alphanumeric or the '-_.` symbols.

I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).


I recall topics like this being discussed at length in some zarr community calls/ channels. I'd recommend checking over there for any recommendations/ solutions.

@melonora
Copy link

I did, the zarr store with current version of spatialdata and anndata version 0.10.3 creates an _index zarr array in the var zgroup. This is different from what we observe with the steinbock data. In the case when manually adjusting var names to include a / in the name did not lead to any problems.

@melonora
Copy link

melonora commented Mar 27, 2024

@LucaMarconato is the uploaded steinbock date perhaps outdated?

@LucaMarconato
Copy link
Member Author

I would argue against this. It hurts accessibility for non-English users. Examples I've seen include Chinese characters used in columns for medical data. I also think it's not terribly uncommon for english speakers to want to use greek letters in names for things ( αβ/ γδ T cells, for instance).

I agree with you, I would be up for allowing all the characters for this reason. The only reason why I wanted to restrict the name is that I would like to minimize the risk of weird behaviors if the name of the element is interpreted as a path. Things like /, ../, C:\\path\ etc. Maybe a solution is to disallow certain characters, like the ones that the OS prevents you to use in filenames.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Mar 27, 2024

@melonora, I have just verified. The Steinbock example is up-to-date and it's using the latest anndata 0.10.6.

Did I understand correctly: you can reproduce the bug only when the anndata io is handled by spatialdata, but not when using anndata directly? In that case it could be, maytbe, that somewhere we pass a Zarr group instead of a path, and maybe this causes the Zarr library to create the subpath when creating the Zarr group.

@melonora
Copy link

melonora commented Mar 27, 2024

This is the anndata io by spatialdata. With the merfish example when I adjust the var names I get a var zarr group with this:
afbeelding
Reading it back in and checking table.var in the sdata object gives me:
afbeelding

So in short when adjusting the table in the merfish example I am able to write to zarr and then read back in.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Mar 27, 2024

Thanks for clarifying. But really not sure what's going on here then.

@melonora
Copy link

Yeah the behaviour is really different. The encoding_type of var in this case is a dataframe. This is specified in .zattrs in var.

@ivirshup
Copy link
Member

I'm a little confused here. I thought the issue was with column names, not row names?

@melonora, if you do:

from anndata.experimental import read_elem, write_elem

import pandas as pd
import zarr

g = zarr.open("test_df.zarr", "w+")
df = pd.DataFrame({"col / with/ slashes": [1,2,3]})
write_elem(g, "df", df)
from_disk = read_elem(g["df"])

from_disk

what do you get?

@melonora
Copy link

This gives an error because of the / being interpreted as a path separator. The thing that I find weird is that the issue arises with the Steinbock dataset where we have this for table.var. When I try to replicate this issue by simply adjusting one of the indices in var to include / I don't seem to have the problem while for the Steinbock dataset I do.

@melonora
Copy link

To be specific this raises a FileNotFoundError as a result of a path with forward slashes not being found:
afbeelding

Copy link

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants