Subsetting the atlas #1

ccruizm · 2022-10-18T13:29:09Z

Good day,

First of all, thank you for making this incredible resource! And also for making it accessible already. I have downloaded the loom file and have tried to load and convert it into AnnData using sc.read_loom. However, I have not succeeded yet. Since the function will load the file into memory, I am running into memory issues (the last attempt was in an HPC with 512GB RAM).

I am not familiar with working with loom files. I have had a look at the documentation and trying to subset the loom file to extract the gene x cell matrix and the associated metadata to later work with it in Scanpy or Suerat, but I am not sure I am doing it correctly.

Thus far I am trying to get the matrix first (which at this moment is still running and do not know what the outcome is)

counts = ds[:, (ds.ca.ROIGroupCoarse == "Hypothalamus") | 
  (ds.ca.ROIGroupCoarse == "Amygdala")]

But I am not sure how I can subset later only the metadata associated with this subset (clusters, subclusters, cell IDs, etc). Could you please tell me what would be the best way to do it?

Thank you in advance!

The text was updated successfully, but these errors were encountered:

slinnarsson · 2022-10-19T17:21:40Z

The file contains something like 3.3 million cells by 58,000 genes, i.e. about 2*10^11 datapoints. Each datapoint is uint16, i.e. 2 bytes, for a grand total of about 400 GB of data. Maybe scanpy converts to float32 or even float64, which will double or quadruple the size.

Even if stored as a sparse matrix, it will be very large. You can get a sparse matrix by doing

x = ds[””].sparse()

To subset other data, just do the same as you did for the matrix:

labels = ds.ca.Clusters[(ds.ca.ROIGroupCoarse == "Hypothalamus") |
(ds.ca.ROIGroupCoarse == "Amygdala")]

Archanayd · 2022-10-20T04:11:41Z

Hi,
Thanks for the amazing resource and making the data available to the public. I have a similar query.
What is the best way to read+write 'adult_human_20221007.loom' since its so big in size?
In general, I would be happy to just get count matrices and associated metadata for major cell types such as Microglia, Oligodendrocytes, etc.
I am also not very familiar with working with loom files.

Thanks again
Archana

slinnarsson · 2022-10-21T12:45:39Z

Docs are here: http://loompy.org

If you want to load into memory as a dense matrix, you will need ~400GB RAM because that's how large the matrix is. You can load it as a sparse matrix according to the code above, which should be much smaller but still large (I guess, 40 GB?).

The best way to work with loom files this large is probably to not try to load it all in memory. The docs contain many examples of how to do that. For example, to load the expression vector for ACTB across all cells:

ds[ds.ra.Gene == "ACTB", :]

To load the expression vector for a specific cell:

ds[:, ds.ca.CellID == "AAACATACATTCTC-1"]

To get the expression matrix for all cells in cluster 132:

x = ds[:, ds.ca.Clusters == 132]

(you can filter all the metadata in the same way)

To fit a PCA to the full dataset, without loading it all in RAM:

from sklearn.decomposition import IncrementalPCA
genes = (ds.ra.Selected == 1)
pca = IncrementalPCA(n_components=50)
for (ix, selection, view) in ds.scan(axis=1):
    pca.partial_fit(view[genes, :].transpose())

rlittman16 · 2022-10-24T16:11:06Z

I wanted to subset the loom to each individual tissue. Either to save as a scanpy h5ad or save as a loom file to be read into scanpy. Do you have a way of doing that?

slinnarsson · 2022-10-25T07:57:27Z

Check out the scan() method, which provides an automatic way of scanning through the loom file and all its attributes:

for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
   # 'view' now contains a batch of cells from Hypothalamus, with all the corresponding attributes
   # You can append each batch of cells (and any column attributes you want) to a HDF5 file or any other target

The scan() method returns a sequence of views, which are like a virtual mini-loom file containing just a batch of cells (out of the selected items), so that you can easily scan through arbitrarily large files without loading them fully into RAM.

rlittman16 · 2022-10-25T17:29:23Z

I get the following error with the above code.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 1>()
      2 print(ds.ca.keys())
      3 print(ds.shape)
----> 5 for (ix, selection, view) in ds.scan(items=(ds.ca.ROIGroupCoarse == "Hypothalamus"), axis=1):
      6     print(ix)
      7     print(selection)

File ~/project-rlittman/miniconda3/envs/scing/lib/python3.8/site-packages/loompy/loompy.py:629, in LoomConnection.scan(self, items, axis, layers, key, batch_size, what)
    627 if "col_attrs" in what:
    628 	if selection is not None:
--> 629 		ca = self.ca[ix + selection]
    630 	else:
    631 		ca = self.ca[ix: ix + cols_per_chunk]

File ~/project-rlittman/miniconda3/envs/scing/lib/python3.8/site-packages/loompy/attribute_manager.py:83, in AttributeManager.__getitem__(self, thing)
     81 	am = AttributeManager(None, axis=self.axis)
     82 	for key, val in self.items():
---> 83 		am[key] = val[thing]
     84 	return am
     85 elif type(thing) is tuple:
     86 	# A tuple of strings giving alternative names for attributes

TypeError: 'NoneType' object is not subscriptable

Ararder · 2022-11-17T21:28:03Z

FYI, for anyone having the above issue, I was able to fix this by running.
del ds.ca['Roi']
del ds.ca['Tissues']
Seems like these variables are missing/corrupted when reading the file with loompy. If removed, scan works as expected.

Ararder · 2022-11-18T07:40:19Z

I'm not sure exactly what you are asking here. Maybe post the code and some info about the error? And I have no idea why are you deleting ROIGroupCoarse and ROIGroupFine. Those works perfectly for me.
By reading the error message and going through the source code.
I'm new to loom, so I cannot give you a good answer. These variables are there, but for some reason loompy, file reading is not working for those.
With R, I was able to extract those variables just fine.
h5f <- H5Fopen("adult_human_20221007.loom")
roi <- h5f$col_attrs$Roi
cellID <- h5f$col_attrs$CellID
df = tibble(CellID = as.data.frame(cellID)[[1]], roi = as.data.frame(roi)[[1]])
write_tsv(df, "adult_human_20221007_roi.tsv")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsetting the atlas #1

Subsetting the atlas #1

ccruizm commented Oct 18, 2022

slinnarsson commented Oct 19, 2022

Archanayd commented Oct 20, 2022

slinnarsson commented Oct 21, 2022 •

edited

Loading

rlittman16 commented Oct 24, 2022

slinnarsson commented Oct 25, 2022

rlittman16 commented Oct 25, 2022

Ararder commented Nov 17, 2022 •

edited

Loading

Ararder commented Nov 18, 2022 •

edited

Loading

Subsetting the atlas #1

Subsetting the atlas #1

Comments

ccruizm commented Oct 18, 2022

slinnarsson commented Oct 19, 2022

Archanayd commented Oct 20, 2022

slinnarsson commented Oct 21, 2022 • edited Loading

rlittman16 commented Oct 24, 2022

slinnarsson commented Oct 25, 2022

rlittman16 commented Oct 25, 2022

Ararder commented Nov 17, 2022 • edited Loading

Ararder commented Nov 18, 2022 • edited Loading

slinnarsson commented Oct 21, 2022 •

edited

Loading

Ararder commented Nov 17, 2022 •

edited

Loading

Ararder commented Nov 18, 2022 •

edited

Loading