This repository contains a scripts and tools for clustering sc/snRNAseq data. Specifically, it has been used to cluster roughly 6M single-nuclei profiles generated in the Macosko lab.
The analysis makes heavy use of
- Our clustering single-nuclei python method, scalable to millions of cells (in this repository)
- SpaceXR, a method to map clustered cells onto Slide-seq pucks
- scOnline, an analytical framework for analysing millions of transcriptomic profiles
The sequenced libraries were aligned with cellranger count and then processed with CellBender remove-background.
We make heavy use of Dask and Zarr. This combination enables the largest computations to be performed in parallel chunks without ever loading the full dataset into memory. Analysis was performed on a very large (n2d-highmem-64
) virtual machine on Google Cloud.
The overall procedure (listed values are parameters that can be changed):
- Aggregate all processed h5 files into one Zarr array, which is stored on disk
- Filter to samples with ≥500 UMIs and <1% UMIs mapping to mitochondrial genes
- Iterative clustering, starting with the full array:
- Gene selection via a binomial approximation of the sequencing process. Genes that are >5% below expected frequency are selected
- Square-root transformation of the raw counts
- log1p is also possible, as well as scaling and PCA (using Dask ML), but these steps were not used for our analysis
- kNN construction using pynndescent, with cosine distance
- SNN computed from kNN, implemented to run in parallel with numba
- Leiden clustering over a range of resolutions, using the leidenalg package. We choose the first resolution that yields a non-trivial clustering (i.e. more than one cluster).
- Subcluster the results until either a) no further clusters are found or more commonly b) the resulting clusters have no DE genes between them.
The clustering is stored in a nested directory structure, with a helper class defined in the leiden_tree
module.
The best way to install this package is in a new conda repository. The environment.yaml
file makes this simple:
git clone https://github.com/MacoskoLab/brain-atlas.git # clone this repository
cd brain-atlas
conda env create -f environment.yaml # creates an environment named `atlas`
conda activate atlas
You can specify an alternative name for the environment with the option -n [ENV_NAME]
.
If you're installing the package on a GCP instance, your credentials should already be configured. Otherwise you may need to authenticate:
gcloud init
gcloud auth login
gcloud auth application-default login # not sure if this is needed
# ... follow instructions
The atlas query
command will subsample a subtree from the full clustering and save the count data and metadata as a compressed numpy array (.npz
) and csv
file respectively. The interface looks like this:
➜ atlas query
2022-01-20 15:44:17 - brain_atlas.scripts.__init__ - INFO - No Dask cluster found
2022-01-20 15:44:17 - brain_atlas.scripts.__init__ - INFO - Starting cluster on local machine
Usage: atlas query [OPTIONS] [QUERY]...
Extracts QUERY from a cluster array, subsampling the clusters if needed, and
creates an npy file containing the counts along with a csv of metadata.
Can be run on a local zarr directory, or download data from GCS
Options:
--data-path TEXT Path to count array [required]
--cluster-path TEXT Path to cluster array [required]
--metadata-path TEXT Path to cell metadata [required]
--output-dir DIRECTORY Directory for output [required]
--subsample INTEGER Subsample the data per cluster [default: 5000]
--max-cells INTEGER Max number of cells to return [default: 50000]
--help Show this message and exit.
data-path
and cluster-path
are the paths to Zarr arrays containing the counts and cluster labels respectively. These can be local directories or locations on Google Cloud Storage. metadata-path
is similar, but should point to a Parquet dataframe with metadata for the cells.
All three of these arrays will be subset to descendents of the requested QUERY
, which is a series of integers that defines a subset of the cluster tree, e.g. 0 1 0 2
to select everything beyond that node in the tree. The results are written to output-dir
.
subsample
will subsample each leaf cluster to at most the specified number of cells. If the result would be more than max-cells
it will not be downloaded—this is a precaution to avoid accidentally downloading very large arrays. To download it anyway, simply increase the limit.