Analyses for integrating ENCODE single-nucleus ATAC-Seq liver datasets.
Contact information: Austin Wang [email protected]
- A Linux-based OS
- A conda-based Python 3 installation
- Snakemake v6.6.1+ (full installation)
- An ENCODE DCC account with access to the necessary datasets
- Install any necessary requirements above
- Download the pipeline
git clone https://github.com/kundajelab/ENCODE_scatac
- Activate the
snakemake
conda environment:conda activate snakemake
- Run the pipeline:
Here,
snakemake -k --use-conda --cores $NCORES
$NCORES
is the number of cores to utilize
When run for the first time, the pipeline will take some time to install conda packages.
- Cell filtering based on minimum fragment count and TSS enrichment
- Iterative LSI dimensionality reduction (ArchR)
- Louvain Clustering
- Dataset integration with Harmony
- Cell type labeling guided by Guilliams et al. Cell 2022, in addition to the ENCODE Liver RNA datasets
All parameters can be found in config/config.yaml
.
The pipeline will automatically pull the required datasets from the ENCODE portal. The relevant ENCODE ID's can be found in config/samples_atac.tsv
and config/samples_multiome.tsv
.
The pipeline will additionally download relevant non-ENCODE data files. URLs to these can be found in config/config.yaml
.
This pipeline contains code for both the RNA and ATAC liver data analyses. Running the pipeline will generate results for both the RNA and ATAC analyses. ATAC outputs are organized in the export/atac
directory, relative to the current working directory.
atac
├── metadata.tsv.gz # Cell metadata file
├── embeddings # Embedding coordinates files
│ └── $EMBEDDING.tsv.gz
├── labels # Cell type labels files
│ └── $CELL_LABELS_SET.tsv.gz
├── markers # Directory for cell type object marker gene data
│ └── $LABEL_TEMP_ID.tsv.gz # Cell type marker gene file
├── markers_aux # Directory for cell type object auxiliary data
│ └── $LABEL_TEMP_ID.tar.gz # Tarball for cell type auxiliary data
├── figures.tar.gz # Tarball for figure data
├── auxiliary_data.tar.gz # Tarball for auxiliary data
└── datasets.txt # A list of datasets used (ENCODE IDs)
Note: intermediate pipeline outputs will also be placed in the current working directory.