-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subsetting the atlas #1
Comments
The file contains something like 3.3 million cells by 58,000 genes, i.e. about 2*10^11 datapoints. Each datapoint is uint16, i.e. 2 bytes, for a grand total of about 400 GB of data. Maybe scanpy converts to float32 or even float64, which will double or quadruple the size. Even if stored as a sparse matrix, it will be very large. You can get a sparse matrix by doing x = ds[””].sparse() To subset other data, just do the same as you did for the matrix: labels = ds.ca.Clusters[(ds.ca.ROIGroupCoarse == "Hypothalamus") | |
Hi, Thanks again |
Docs are here: http://loompy.org If you want to load into memory as a dense matrix, you will need ~400GB RAM because that's how large the matrix is. You can load it as a sparse matrix according to the code above, which should be much smaller but still large (I guess, 40 GB?). The best way to work with loom files this large is probably to not try to load it all in memory. The docs contain many examples of how to do that. For example, to load the expression vector for ACTB across all cells: ds[ds.ra.Gene == "ACTB", :] To load the expression vector for a specific cell: ds[:, ds.ca.CellID == "AAACATACATTCTC-1"] To get the expression matrix for all cells in cluster 132:
(you can filter all the metadata in the same way) To fit a PCA to the full dataset, without loading it all in RAM: from sklearn.decomposition import IncrementalPCA
genes = (ds.ra.Selected == 1)
pca = IncrementalPCA(n_components=50)
for (ix, selection, view) in ds.scan(axis=1):
pca.partial_fit(view[genes, :].transpose()) |
I wanted to subset the loom to each individual tissue. Either to save as a scanpy h5ad or save as a loom file to be read into scanpy. Do you have a way of doing that? |
Check out the for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
# 'view' now contains a batch of cells from Hypothalamus, with all the corresponding attributes
# You can append each batch of cells (and any column attributes you want) to a HDF5 file or any other target The |
I get the following error with the above code.
|
FYI, for anyone having the above issue, I was able to fix this by running. |
|
Good day,
First of all, thank you for making this incredible resource! And also for making it accessible already. I have downloaded the loom file and have tried to load and convert it into AnnData using
sc.read_loom
. However, I have not succeeded yet. Since the function will load the file into memory, I am running into memory issues (the last attempt was in an HPC with 512GB RAM).I am not familiar with working with loom files. I have had a look at the documentation and trying to subset the loom file to extract the gene x cell matrix and the associated metadata to later work with it in Scanpy or Suerat, but I am not sure I am doing it correctly.
Thus far I am trying to get the matrix first (which at this moment is still running and do not know what the outcome is)
But I am not sure how I can subset later only the metadata associated with this subset (clusters, subclusters, cell IDs, etc). Could you please tell me what would be the best way to do it?
Thank you in advance!
The text was updated successfully, but these errors were encountered: