Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up writeH5AD #129

Open
stemangiola opened this issue Nov 13, 2024 · 5 comments
Open

Speed up writeH5AD #129

stemangiola opened this issue Nov 13, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@stemangiola
Copy link

I handle a lot of data, HCA-scale. I would like a way to speed up the saving of large SCE. This could be achieved through low-level optimisation if possible, parallelization or possibly tuning of the block size (such as for HDF5Array).

Thanks!

@lazappi lazappi added the enhancement New feature or request label Nov 14, 2024
@lazappi
Copy link
Member

lazappi commented Nov 14, 2024

Hi @stemangiola

Despite what some AI bot thinks I'm not sure how easy this would be to change. {zellkonverter} does most of the object writing by passing things to Python and getting Python anndata to do it. The only exception is for DelayedArray matrices that are written manually as a workaround from @LTLA to an issue (which I think is now fixed). If that is what you are looking for we might be able to look into it but I'm not super familiar with how that works.

As an alternative, I'm also involved with {anndataR} where we are trying to make a native R implementation of anndata. This kind of then would be great to have there if you want to contribute something.

@LTLA
Copy link
Contributor

LTLA commented Nov 14, 2024

I've long forgotten what I did but I doubt parallelization is going to help much here. HDF5 writes are single-threaded due to their SWMR model. If your input assays are DelayedArray objects, then parallelization could help with realization of delayed arrays into memory, but this comes as the cost of increased memory usage and it would eventually be bottlenecked by the write to disk anyway.

If you need fast writes, you'd be better off with something like TileDB. But even so... for around a million cells, the analysis is going to take at least an hour anyway, so an extra 5-10 minutes saving to disk doesn't seem too bad.

@stemangiola
Copy link
Author

stemangiola commented Nov 14, 2024

Thanks @LTLA , for TileDB you mean saveTileDBSummarizedExperiment?

Is TileDB supported well for both R and Python?

@lazappi {anndataR} seems interesting!

@LTLA
Copy link
Contributor

LTLA commented Nov 14, 2024

I'm not aware of any saveTileDBSummarizedExperiment function. If one doesn't exist, you could consider writing one in TileDBArray, analogous to how saveHDF5SummarizedExperiment lives in TileDBArray.

TileDB has official support for both R and Python. I've mostly used the R client and I've never had any major problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants
@lazappi @stemangiola @LTLA and others