Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel crashing with dataset of size 38237 × 32285 #361

Open
HeleneHemmer opened this issue Jul 30, 2024 · 11 comments
Open

Kernel crashing with dataset of size 38237 × 32285 #361

HeleneHemmer opened this issue Jul 30, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@HeleneHemmer
Copy link

Kernel crashing with large dataset

Dear Spapros-team,

thank you for developing the probe selection pipeline!

I am successfully able to run the pipeline on a dataset of size 8505 × 31053. Now I would like to move to larger dataset size containing 38237 cells and 32285 genes. Unfortunately, every time that I run the function "selector.select_probeset()", the kernel crashes after 10-15 minutes due to out of memory error.
Currently, I am using 520 GB of memory and 10 cores on a Linux machine using Fedora 40. My mamba environment uses Python v3.10.14 and I installed Spapros v0.1.5 via pip.

Any help would be much appreciated!

@HeleneHemmer HeleneHemmer added the bug Something isn't working label Jul 30, 2024
@LouisK92
Copy link
Collaborator

Hi, thanks for reporting the issue!

Wow 520GB is definitely a lot. The number of cells and genes should be fine wrt memory, however high numbers of cell types can lead to high memory usage. How many cell types do you have? If it's an extreme number I'd recommend a hierarchical approach (multiple selections on subsets of the data based on coarser cell type labels).
Do you preselect highly variable genes? That could reduce memory usage as well, and might provide a quick solution for this use case...
In case you had some output info printed wrt the computation step where it fails - was it during the tree training?

@HeleneHemmer
Copy link
Author

Hi Louis
thanks for responding so quickly! Yes, I pre-selected on HVGs (8000 as described in your publication).
I think subsampling the dataset is the way to go. There are 21 celltypes in the dataset that I am using now and comparing it to your sample data, it really is quite a lot.
Unfortunately, there was no really meaningful output info printed. But the kernel crashing happens during this step: Train prior forest for DE_baseline forest.

@LouisK92
Copy link
Collaborator

Ah but 21 cell types is really not too many. There shouldn't be an issue.
Is there a chance that you could share (privately) the data with me, so I can try to reproduce the issue? Otherwise it's a bit hard from my side to debug it

@HeleneHemmer
Copy link
Author

Dear Louis,
unfortunately the data is unpublished material, so I am not allowed to share it. But I want to try with another similarly sized public dataset to check if it is a problem with the data itself.
Thank you anyway for helping me out! I'll let you know if I run into further problems.

@LouisK92
Copy link
Collaborator

LouisK92 commented Aug 1, 2024

I understand.
Okay, I hope we'll figure out what's the issue.
Would be great if you can reproduce the error with another dataset, so I can have a deeper look.

@HeleneHemmer
Copy link
Author

HeleneHemmer commented Aug 6, 2024

Dear Louis!
I ran the pipeline again using the same amount of memory and cores to analyze the public data from the liver cell atlas (mouse, steady state, all cells) that I merged with the data from Friedrich et al in this study. The merged dataset contains 37 celltypes, that I reduced to 2000 cells each, as described in your pre-print for the heart data. The final dataset has a size of 48433 cells x 31053 genes.

When running the pipeline, I first get this warning message, which is related to the format of the input data I assume:

/.conda/envs/Spapros/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:461: 
RuntimeWarning: invalid value encountered in log2
  self.stats[group_name, "logfoldchanges"] = np.log2(

At some point the following error messages appear:

.conda/envs/Spapros/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:
752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short
worker timeout or by a memory leak.
warnings.warn(

When I check the folders generated for the Spapros output, they are empty, except for the folder "selections" and "trees". "Selections" contains "DE_selected.csv" and "pca_selected.csv". Trees contains "DE_prior_forest_results.pkl". The main folder also contains the "time_measurements.csv".

If it helps, I can also privately share the dataset and the script that I used for the analysis.

@LouisK92
Copy link
Collaborator

LouisK92 commented Aug 7, 2024

Interesting.

You're running this on a shared machine of which you allocated 10 cores, right?

Could you try running it once setting spapros.se.ProbesetSelector(..., n_jobs=8) (10 should also work, but just to be sure)

Would it be also possible to allocate a whole machine for yourself? (in case I'm assuming correctly that it's a shared one)

I also always run it in a similar setup, so I'd be surprised if that's the issue, but I've seen similar errors in another context.

Yes would be great if you can send me the test data and your script. I'll send you a pm

@HeleneHemmer
Copy link
Author

Hey! Unfortunately, I cannot use a whole machine just for myself, but I can try again with 8 instead of 10 cores. How can I best share the data? We can also communicate via email if that helps.

@palfalvi
Copy link

Hi!

I have similar issues with a dataset of 29398 × 26355 size and 18 clusters.

selector.select_probeset() just gives a OSError: [Errno 12] Cannot allocate memory whenever n_jobs is more than 1.
I have >100 Gb memory (134 minus 18 Gb before running Spapros, minus 43.8 after creating the selector object) on a local Debian machine.
With n_jobs=1 it takes over 5 hours and I can barely see any extra memory usage. With n_jobs=2 it dies after less than a minute with a max memory usage of <50% of the available.

@LouisK92
Copy link
Collaborator

Hi @palfalvi - thanks a lot for reporting, including the details!

I'll try to have a look tomorrow

@LouisK92
Copy link
Collaborator

LouisK92 commented Dec 6, 2024

Sorry for not coming back to this earlier.
It seems a bit difficult to pin down the memory leak, at least my profiling techniques don't indicate the issue clearly. I suppose the multiprocessing interferes with the profiling. And also, with some test data that I use, I only run after longer computations into the memory leak, so waiting times also slow things down.

Anyway, I think what happens is the following: If adata.X does not have type float32, then copies of the data are generated during the tree training. When multiprocessing is activate then these copies are probably not properly garbage collected.

@palfalvi in case your adata.X doesn't have type float32, could you try

adata.X = adata.X.astype(np.float32) # assuming adata.X is sparse

before you provide adata to the ProbesetSelector? Please let me know if that changes something in your case.

@HeleneHemmer the line above should solve the memory leak for running the selection on your data. For your data the selection is unfortunately quite slow, which still seems a bit weird to me. I'm investigating a solution for that as well right now.

When finished I'll update the package to a new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants