Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting the atlas #1

Open
ccruizm opened this issue Oct 18, 2022 · 8 comments
Open

Subsetting the atlas #1

ccruizm opened this issue Oct 18, 2022 · 8 comments

Comments

@ccruizm
Copy link

ccruizm commented Oct 18, 2022

Good day,

First of all, thank you for making this incredible resource! And also for making it accessible already. I have downloaded the loom file and have tried to load and convert it into AnnData using sc.read_loom. However, I have not succeeded yet. Since the function will load the file into memory, I am running into memory issues (the last attempt was in an HPC with 512GB RAM).

I am not familiar with working with loom files. I have had a look at the documentation and trying to subset the loom file to extract the gene x cell matrix and the associated metadata to later work with it in Scanpy or Suerat, but I am not sure I am doing it correctly.

Thus far I am trying to get the matrix first (which at this moment is still running and do not know what the outcome is)

counts = ds[:, (ds.ca.ROIGroupCoarse == "Hypothalamus") | 
  (ds.ca.ROIGroupCoarse == "Amygdala")]

But I am not sure how I can subset later only the metadata associated with this subset (clusters, subclusters, cell IDs, etc). Could you please tell me what would be the best way to do it?

Thank you in advance!

@slinnarsson
Copy link
Contributor

The file contains something like 3.3 million cells by 58,000 genes, i.e. about 2*10^11 datapoints. Each datapoint is uint16, i.e. 2 bytes, for a grand total of about 400 GB of data. Maybe scanpy converts to float32 or even float64, which will double or quadruple the size.

Even if stored as a sparse matrix, it will be very large. You can get a sparse matrix by doing

x = ds[””].sparse()

To subset other data, just do the same as you did for the matrix:

labels = ds.ca.Clusters[(ds.ca.ROIGroupCoarse == "Hypothalamus") |
(ds.ca.ROIGroupCoarse == "Amygdala")]

@Archanayd
Copy link

Hi,
Thanks for the amazing resource and making the data available to the public. I have a similar query.
What is the best way to read+write 'adult_human_20221007.loom' since its so big in size?
In general, I would be happy to just get count matrices and associated metadata for major cell types such as Microglia, Oligodendrocytes, etc.
I am also not very familiar with working with loom files.

Thanks again
Archana

@slinnarsson
Copy link
Contributor

slinnarsson commented Oct 21, 2022

Docs are here: http://loompy.org

If you want to load into memory as a dense matrix, you will need ~400GB RAM because that's how large the matrix is. You can load it as a sparse matrix according to the code above, which should be much smaller but still large (I guess, 40 GB?).

The best way to work with loom files this large is probably to not try to load it all in memory. The docs contain many examples of how to do that. For example, to load the expression vector for ACTB across all cells:

ds[ds.ra.Gene == "ACTB", :]

To load the expression vector for a specific cell:

ds[:, ds.ca.CellID == "AAACATACATTCTC-1"]

To get the expression matrix for all cells in cluster 132:

x = ds[:, ds.ca.Clusters == 132]

(you can filter all the metadata in the same way)

To fit a PCA to the full dataset, without loading it all in RAM:

from sklearn.decomposition import IncrementalPCA
genes = (ds.ra.Selected == 1)
pca = IncrementalPCA(n_components=50)
for (ix, selection, view) in ds.scan(axis=1):
    pca.partial_fit(view[genes, :].transpose())

@rlittman16
Copy link

I wanted to subset the loom to each individual tissue. Either to save as a scanpy h5ad or save as a loom file to be read into scanpy. Do you have a way of doing that?

@slinnarsson
Copy link
Contributor

Check out the scan() method, which provides an automatic way of scanning through the loom file and all its attributes:

for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
   # 'view' now contains a batch of cells from Hypothalamus, with all the corresponding attributes
   # You can append each batch of cells (and any column attributes you want) to a HDF5 file or any other target

The scan() method returns a sequence of views, which are like a virtual mini-loom file containing just a batch of cells (out of the selected items), so that you can easily scan through arbitrarily large files without loading them fully into RAM.

@rlittman16
Copy link

I get the following error with the above code.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 1>()
      2 print(ds.ca.keys())
      3 print(ds.shape)
----> 5 for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
      6     print(ix)
      7     print(selection)

File ~/project-rlittman/miniconda3/envs/scing/lib/python3.8/site-packages/loompy/loompy.py:629, in LoomConnection.scan(self, items, axis, layers, key, batch_size, what)
    627 if "col_attrs" in what:
    628 	if selection is not None:
--> 629 		ca = self.ca[ix + selection]
    630 	else:
    631 		ca = self.ca[ix: ix + cols_per_chunk]

File ~/project-rlittman/miniconda3/envs/scing/lib/python3.8/site-packages/loompy/attribute_manager.py:83, in AttributeManager.__getitem__(self, thing)
     81 	am = AttributeManager(None, axis=self.axis)
     82 	for key, val in self.items():
---> 83 		am[key] = val[thing]
     84 	return am
     85 elif type(thing) is tuple:
     86 	# A tuple of strings giving alternative names for attributes

TypeError: 'NoneType' object is not subscriptable

@Ararder
Copy link

Ararder commented Nov 17, 2022

FYI, for anyone having the above issue, I was able to fix this by running.
del ds.ca['Roi']
del ds.ca['Tissues']
Seems like these variables are missing/corrupted when reading the file with loompy. If removed, scan works as expected.

@Ararder
Copy link

Ararder commented Nov 18, 2022

  1. I'm not sure exactly what you are asking here. Maybe post the code and some info about the error? And I have no idea why are you deleting ROIGroupCoarse and ROIGroupFine. Those works perfectly for me.

  2. By reading the error message and going through the source code.

  3. I'm new to loom, so I cannot give you a good answer. These variables are there, but for some reason loompy, file reading is not working for those.
    With R, I was able to extract those variables just fine.
    h5f <- H5Fopen("adult_human_20221007.loom")
    roi <- h5f$col_attrs$Roi
    cellID <- h5f$col_attrs$CellID
    df = tibble(CellID = as.data.frame(cellID)[[1]], roi = as.data.frame(roi)[[1]])
    write_tsv(df, "adult_human_20221007_roi.tsv")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants