-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel crashing with dataset of size 38237 × 32285 #361
Comments
Hi, thanks for reporting the issue! Wow 520GB is definitely a lot. The number of cells and genes should be fine wrt memory, however high numbers of cell types can lead to high memory usage. How many cell types do you have? If it's an extreme number I'd recommend a hierarchical approach (multiple selections on subsets of the data based on coarser cell type labels). |
Hi Louis |
Ah but 21 cell types is really not too many. There shouldn't be an issue. |
Dear Louis, |
I understand. |
Dear Louis! When running the pipeline, I first get this warning message, which is related to the format of the input data I assume:
At some point the following error messages appear:
When I check the folders generated for the Spapros output, they are empty, except for the folder "selections" and "trees". "Selections" contains "DE_selected.csv" and "pca_selected.csv". Trees contains "DE_prior_forest_results.pkl". The main folder also contains the "time_measurements.csv". If it helps, I can also privately share the dataset and the script that I used for the analysis. |
Interesting. You're running this on a shared machine of which you allocated 10 cores, right? Could you try running it once setting Would it be also possible to allocate a whole machine for yourself? (in case I'm assuming correctly that it's a shared one) I also always run it in a similar setup, so I'd be surprised if that's the issue, but I've seen similar errors in another context. Yes would be great if you can send me the test data and your script. I'll send you a pm |
Hey! Unfortunately, I cannot use a whole machine just for myself, but I can try again with 8 instead of 10 cores. How can I best share the data? We can also communicate via email if that helps. |
Hi! I have similar issues with a dataset of 29398 × 26355 size and 18 clusters.
|
Hi @palfalvi - thanks a lot for reporting, including the details! I'll try to have a look tomorrow |
Sorry for not coming back to this earlier. Anyway, I think what happens is the following: If @palfalvi in case your adata.X = adata.X.astype(np.float32) # assuming adata.X is sparse before you provide @HeleneHemmer the line above should solve the memory leak for running the selection on your data. For your data the selection is unfortunately quite slow, which still seems a bit weird to me. I'm investigating a solution for that as well right now. When finished I'll update the package to a new version. |
Kernel crashing with large dataset
Dear Spapros-team,
thank you for developing the probe selection pipeline!
I am successfully able to run the pipeline on a dataset of size 8505 × 31053. Now I would like to move to larger dataset size containing 38237 cells and 32285 genes. Unfortunately, every time that I run the function "selector.select_probeset()", the kernel crashes after 10-15 minutes due to out of memory error.
Currently, I am using 520 GB of memory and 10 cores on a Linux machine using Fedora 40. My mamba environment uses Python v3.10.14 and I installed Spapros v0.1.5 via pip.
Any help would be much appreciated!
The text was updated successfully, but these errors were encountered: