Writing dict in uns with many keys is slow #1684

grst · 2024-09-21T12:54:12Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of anndata.
(optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

import anndata
import numpy as np

adata = anndata.AnnData()
adata.uns["x"] = {str(i): np.array(str(i), dtype="object") for i in range(20000)}

# %%time
adata.write_h5ad("/tmp/anndata.h5ad")

# %%time
anndata.read_h5ad("/tmp/anndata.h5ad")

On my machine, this takes 7s to write and 4s to load for a dictionary with only 20k elements.
How hard would it be to make this (significantly) faster?

Additional context

In scirpy, I use dicts of arrays (one index referring to $n$ cells) to store clonotype clusters. The dictionary is not (necessarily) aligned to one of the axes, therefore it's in uns. As we sped up the clonotype clustering steps, saving the object becomes a major bottleneck, as this dict can have several hundreds of thousands of keys.

We could possibly change the dictionary to something more efficient, but that would mean breaking our data format. Therefore I first wanted to check if it can be made faster on the anndata side.

CC @felixpetschko

Versions

-----
anndata             0.9.2
numpy               1.24.4
session_info        1.0.0
-----
asciitree           NA
asttokens           NA
awkward             2.6.4
awkward_cpp         NA
backcall            0.2.0
cloudpickle         2.2.1
comm                0.1.4
cython_runtime      NA
dask                2023.8.1
dateutil            2.8.2
debugpy             1.6.8
decorator           5.1.1
entrypoints         0.4
executing           1.2.0
fasteners           0.18
fsspec              2023.6.0
h5py                3.9.0
importlib_metadata  NA
ipykernel           6.25.0
jedi                0.19.0
...
Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-6.10.5-arch1-1-x86_64-with-glibc2.40
-----
Session information updated at 2024-09-21 14:49

ilan-gold · 2024-11-05T09:49:25Z

Hmmm @gregor Sturm I would suspect the issue is that we recursively write the keys' values as their native data type, which means you end up creating thousands of zarr/hdf5 arrays. I'm not really sure we can do much about that at the moment. But with the coming zarr v3 we might in theory be able to do this in parallel, which would be a big boost. So I think we should wait for that: #1726 will be a first step just to getting things working.

I'm not sure the async/parallel zarr stuff works with v2, but I think it does.

grst · 2024-11-05T19:04:22Z

Thanks for your response! I think we'll just adapt our data format to be more efficient in that case.
Feel free to close.

grst added Bug 🐛 Triage 🩺 labels Sep 21, 2024

grst mentioned this issue Sep 21, 2024

Speed up anndata writing speed after define_clonotype_clusters scverse/scirpy#556

Merged

ilan-gold removed the Triage 🩺 label Sep 24, 2024

ilan-gold self-assigned this Sep 24, 2024

ilan-gold added this to the 0.12.0 milestone Sep 24, 2024

grst added the performance 🐌 label Oct 2, 2024

grst added this to scirpy-dev Nov 4, 2024

github-project-automation bot moved this to ToDo in scirpy-dev Nov 4, 2024

ilan-gold removed the Bug 🐛 label Nov 6, 2024

grst removed this from scirpy-dev Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing dict in uns with many keys is slow #1684

Writing dict in uns with many keys is slow #1684

grst commented Sep 21, 2024

ilan-gold commented Nov 5, 2024

grst commented Nov 5, 2024

Writing dict in uns with many keys is slow #1684

Writing dict in uns with many keys is slow #1684

Comments

grst commented Sep 21, 2024

Please make sure these conditions are met

Report

Additional context

Versions

ilan-gold commented Nov 5, 2024

grst commented Nov 5, 2024