Speed up writeH5AD #129

stemangiola · 2024-11-13T11:00:47Z

I handle a lot of data, HCA-scale. I would like a way to speed up the saving of large SCE. This could be achieved through low-level optimisation if possible, parallelization or possibly tuning of the block size (such as for HDF5Array).

Thanks!

The text was updated successfully, but these errors were encountered:

lazappi · 2024-11-14T06:37:27Z

Hi @stemangiola

Despite what some AI bot thinks I'm not sure how easy this would be to change. {zellkonverter} does most of the object writing by passing things to Python and getting Python anndata to do it. The only exception is for DelayedArray matrices that are written manually as a workaround from @LTLA to an issue (which I think is now fixed). If that is what you are looking for we might be able to look into it but I'm not super familiar with how that works.

As an alternative, I'm also involved with {anndataR} where we are trying to make a native R implementation of anndata. This kind of then would be great to have there if you want to contribute something.

LTLA · 2024-11-14T07:01:09Z

I've long forgotten what I did but I doubt parallelization is going to help much here. HDF5 writes are single-threaded due to their SWMR model. If your input assays are DelayedArray objects, then parallelization could help with realization of delayed arrays into memory, but this comes as the cost of increased memory usage and it would eventually be bottlenecked by the write to disk anyway.

If you need fast writes, you'd be better off with something like TileDB. But even so... for around a million cells, the analysis is going to take at least an hour anyway, so an extra 5-10 minutes saving to disk doesn't seem too bad.

stemangiola · 2024-11-14T07:13:45Z

Thanks @LTLA , for TileDB you mean saveTileDBSummarizedExperiment?

Is TileDB supported well for both R and Python?

@lazappi {anndataR} seems interesting!

LTLA · 2024-11-14T07:31:18Z

I'm not aware of any saveTileDBSummarizedExperiment function. If one doesn't exist, you could consider writing one in TileDBArray, analogous to how saveHDF5SummarizedExperiment lives in TileDBArray.

TileDB has official support for both R and Python. I've mostly used the R client and I've never had any major problems.

lazappi added the enhancement New feature or request label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up writeH5AD #129

Speed up writeH5AD #129

stemangiola commented Nov 13, 2024

lazappi commented Nov 14, 2024

LTLA commented Nov 14, 2024

stemangiola commented Nov 14, 2024 •

edited

Loading

LTLA commented Nov 14, 2024

Speed up writeH5AD #129

Speed up writeH5AD #129

Comments

stemangiola commented Nov 13, 2024

lazappi commented Nov 14, 2024

LTLA commented Nov 14, 2024

stemangiola commented Nov 14, 2024 • edited Loading

LTLA commented Nov 14, 2024

stemangiola commented Nov 14, 2024 •

edited

Loading