Table of Contents
This repo contains files, scripts, and analysis related to exploring integration of single-cell and single-nuclei data.
Package dependencies for the analysis workflows in this repository are managed using renv
.
For renv
to work as intended, you'll need to work within the sc-data-integration.Rproj
project in RStudio.
You may need to run renv::restore()
upon opening the project to ensure the renv.lock
file is synced with the project library.
Each time you install or use new packages, you will want to run renv::snapshot()
to update the renv.lock
file with any added package and dependencies necessary to run the analyses and scripts in this repo.
If there are dependencies you want to include that are not captured automatically by renv::snapshot()
, add them to components/dependencies.R
with a call to library()
and an explanatory comment.
For example, if dplyr
were recommended but not required by a package and you wanted to make sure to include it in the lockfile, you would add library(dplyr)
to components/dependencies.R
.
Then rerun renv::snapshot()
.
The main workflow for the integration scripts is written with Snakemake, which will handle most dependencies internally, including the renv
environment.
You will need the latest version of snakemake
and the peppy
python package.
The easiest way to install these is with conda
and/or mamba
, which you will want to set up to use the bioconda
and conda-forge
channels using the following series of commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
You can then install snakemake
and peppy
into your preferred environment with:
mamba install snakemake peppy
(Use conda install
if you do not have mamba
installed.)
Python-based environments will be built automatically by Snakemake when the workflow is run, but the environment for R should be built before running the workflow.
To create or update the necessary environment for the R scripts, which includes an isolated version of R
, pandoc
, and the renv
package installation, run the following command from the base of the repository:
bash setup_envs.sh
This script will use Snakemake to install all necessary components for the workflow in an isoloated Conda enviroment. If you are on an Apple Silicon (M1/M2/Arm) Mac, this should properly handle setting up R to use an Intel-based build for compatibility with Bioconductor packages.
This installation may take up to an hour, as all of the R packages will likely have to be compiled from scratch. However, this should be a one-time cost, and ensures that you have all of the tools for the workflow installed and ready.
To use the environment you have just created, you will need to run Snakemake with the --use-conda
flag each time.
If there are updates to the renv.lock
file only, those can be applied with the following command (on any system):
snakemake --use-conda -c2 build_renv
For exploring data integration, we used test datasets that were obtained from the Human Cell Atlas (HCA) Data Portal, the Single-cell Pediatric Cancer Atlas(ScPCA) Portal, and simulated single-cell data published in Luecken et al., (2022).
All data from the HCA that we are using can be found in the private S3 bucket, s3://sc-data-integration/human_cell_atlas_data
.
The simulated data can be downloaded directly from figshare.
All gene expression data used for benchmarking is stored in the private S3 bucket, s3://sc-data-integration
.
In order to access these files, you must be a Data Lab staff member and have credentials setup for AWS.
Inside the sample-info
folder is metadata related to datasets used for testing data integration.
<project_name>-project-metadata.tsv
: This file contains information about each of the projects that are being used for testing integration from a given area (e.g., HCA, simulated, ScPCA). Each row in this file corresponds to a project, dataset, or group of libraries that should be integrated together. Allproject-metadata.tsv
files must contain aproject_name
column, but may also contain other relevant project information such as the following:
column_id | contents |
---|---|
project_name |
The shorthand project name |
source_id |
Unique ID associated with the project |
tissue_group |
Tissue group the project belongs to (e.g. blood, brain, or kidney) |
files_directory |
files directory on S3 |
metadata_filename |
name of metadata file obtained from the HCA |
celltype_filename |
file name corresponding to file containing cell type information as found on HCA |
celltype_filetype |
format of cell type file availble on HCA |
hca-library-metadata.tsv
This file contains information about each library from every project that is being used as a test dataset from the HCA. Each row in this file corresponds to a library and contains the following columns:
column_id | contents |
---|---|
sample_biomaterial_id |
Unique ID associated with the individual sample |
library_biomaterial_id |
Unique ID associated with the individual library that was sequenced |
bundle_uuid |
UUID for the individual folder containing each loom file |
project_name |
The shorthand project name assigned by the HCA |
source_id |
Unique ID associated with the project |
tissue_group |
Tissue group the project belongs to (e.g. blood, brain, or kidney) |
technology |
Sequencing/library technology used (e.g. 10Xv2, 10Xv3, etc.) |
seq_unit |
Sequencing unit (cell or nucleus) |
diagnosis |
Indicates if the sample came from diseasead or normal tissue |
organ |
Specified tissue by the HCA where the sample was obtained from |
organ_part |
Specified tissue region by the HCA where the sample was obtained from |
selected_cell_types |
Identifies the group of cells selected for prior to sequencing, otherwise NA |
s3_files_dir |
files directory on S3 |
loom_file |
loom file name in the format tissue_group/project_name/bundle_uuid/filename |
<project_name>-processed-libraries.tsv
: This file contains the list of libraries from each project that are being used for testing data integration. This metadata file is required for most scripts, including runningscpca-downstream-analyses
using01-run-downstream-analyses.sh
and for running the integration workflow. This file must contain the following columns, but may also contain additional columns related to a given dataset:
column_id | contents |
---|---|
sample_biomaterial_id |
Unique ID associated with the individual sample |
library_biomaterial_id |
Unique ID associated with the individual library that was sequenced |
project_name |
The shorthand project name |
integration_input_dir |
The directory containing the SingleCellExperiment objects to be used as input to the data integration snakemake workflow |
hca-celltype-info.tsv
: This file is not available on the repo and is stored in the private S3 bucket,s3://sc-data-integration/sample-info
. This file contains all available cell type information for projects listed inhca-project-metadata.tsv
. This file was created using thescripts/00a-reformat-celltype-info.R
which takes as input the cell type information available for each project from the Human Cell Atlas Data Portal. The cell type information for each project, in its original format, can be stored ins3://sc-data-integration/human_cell_atlas_data/celltype
. Each row corresponds to a single cell and contain the following information:
column_id | contents |
---|---|
sample_biomaterial_id |
Unique ID associated with the individual sample |
library_biomaterial_id |
Unique ID associated with the individual library that was sequenced |
project |
The shorthand project name assigned by the HCA |
barcode |
The unique cell barcode |
celltype |
The assigned cell type for a given barcode, obtained from cell type data stored in s3://sc-data-integration/human_cell_atlas_data/celltype |
All data and intermediate files are stored in the private S3 bucket, s3://sc-data-integration
.
The following data can be found in the above S3 bucket within the human_cell_atlas_data
folder:
- The
loom
folder contains the original loom files downloaded from the Human Cell Atlas data portal for each test dataset. Here loom files are nested bytissue_group
,project_name
, andbundle_uuid
. - The
sce
folder contains the unfilteredSingleCellExperiment
objects saved as RDS files. TheseSingleCellExperiment
objects have been converted from the loom files using the00-obtain-sce.R
script in thescripts
directory in this repo. Here RDS files are nested bytissue_group
andproject_name
.
The following data can be found in the S3 bucket within the scib_simulated_data
folder:
- The
hdf5
folder contains the originalhdf5
files for simulated data obtained from figshare. - The
sce
folder contains the individualSingleCellExperiment
objects stored asrds
files after runningscripts/00b-obtain-sim-sce.R
andscripts/00c-create-sim1-subsets.R
A separate reference-files
folder contains any reference files needed for processing dataset, such as the gtf file needed to generate the mitochondrial gene list found in the reference-files
folder in the repository.
In order to access these files, you must be a Data Lab staff member and have credentials set up for AWS. Additionally, some of the scripts in this repository require use of AWS command line tools. We have previously written up detailed instructions on installing the AWS command line tools and configuring your credentials that can be used as a reference.
After AWS command line tools have been set up, the SingleCellExperiment
objects found in s3://sc-data-integration/human_cell_atlas_data/sce
can be copied to your local computer by running the 00-obtain-sce.R
script with the --copy_s3
flag.
Rscript scripts/00-obtain-sce.R --copy_s3
This will copy any SingleCellExperiment
objects for libraries listed in hca-processed-libraries.tsv
that have already been converted from loom files.
If any libraries listed in hca-processed-libraries.tsv
do not have corresponding SingleCellExperiment
objects, running the 00-obtain-sce.R
will also convert those loom files.
The human_cell_atlas_results/scpca-downstream-analyses
folder contains all processed SingleCellExperiment
objects and the output from running the core workflow in scpca-downstream-analyses
.
Within this folder each library that has been processed has its own folder that contains both the processed SingleCellExperiment
object and an html summary report.
The SingleCellExperiment
objects in this folder have both been filtered to remove empty droplets and run through scpca-downstream-analyses
using the scripts/01-run-downstream-analyses.sh
script.
This means they contain a logcounts
assay with the normalized counts matrix, both PCA and UMAP embeddings, and clustering assignments that can be found in the louvain_10
column of the colData
.
The SingleCellExperiment
objects present in human_cell_atlas_results/scpca-downstream-analyses
should be the objects used as input for integration methods.
These files were produced and synced to S3 using the following script:
Note: To run the below script, you must have available in your path R (v4.1.2), Snakemake and pandoc
.
pandoc
must be version 1.12.3 or higher, which can be checked using the pandoc -v
command.
bash scripts/01-run-downstream-analyses.sh \
--downstream_repo <full path to scpca-downstream-analyses-repo> \
--s3_bucket "s3://sc-data-integration/human_cell_atlas_results/scpca-downstream-analyses"
Note: If you wish to run ScPCA data through the integration workflow, rather than HCA data, please see the special guidelines for preparing ScPCA data.
To run the integration workflow, invoke snakemake
from the sc-data-integration
directory with the following command:
snakemake -c4 --use-conda
You can adjust the number of cores used by adjusting the -c4
flag with however many cores you want to use where the given number represents the number of desired cores (here, 4).
Note that you will want to have set up the R conda environment already, especially if you are on an Apple Silicon Mac.
To run the workflow for development, you may wish to specify the config-test.yaml
file, which will only run one project through the pipeline to save time:
snakemake -c4 --use-conda --configfile config-test.yaml
Finally, to run the scib_simulated
data through the pipeline, use:
snakemake -c4 --use-conda --configfile config-scib_simulated.yaml