This repository provides a Snakemake pipeline for analyzing single-cell ATAC-seq (scATAC-seq) data using various tools, including cellranger-atac
, Socrates
, Genrich
, and bedtools
. The pipeline processes raw scATAC fastq files, performs barcode cleaning, demultiplexes cells, and identifies accessible chromatin regions (ACRs).
- Install Snakemake
- Required software tools:
cellranger-atac
,sinto
,samtools
,R
,popscle
,bedtools
, andncbi-blast
- Ensure access to SLURM for resource allocation on a high-performance computing (HPC) cluster.
For dependencies, create Conda environments as listed in config/SocratesEnv.yaml
and config/DemuxletEnv.yaml
conda env create -f config/SocratesEnv.yaml
conda env create -f config/DemuxletEnv.yaml
- CELLRANGER_PATH: path to cellranger-atac software
- CELLRANGER_ref: path to reference created with cellranger
- scATACraw: path to fastq of single-cell ATAC-seq reads
- NAME: Sample name of reference for cellranger-atac
- Nuclear: expression for scaffold prefix for nuclear genome for sinto usage
- Plastid: expression for scaffold prefix for plastid genome
- GFF: gene annotation in gff format
- CHRFILE: lengths of scaffolds
- MACSpath: path to macs2 software
- VCFdemux: path to VCF for demuxlet
- SAMPLENAMES: file with atac-seq sample names
- PICARD: path to picard software
- GENRICH: path to genrich software
- SAMPLEMETA: "Metadata/SampleNames.txt"
- WGS: "/path/to/WGS_alignments"
- REFERENCE: "/path/to/reference_genome.fa"
- PLASTID: "/path/to/plastid_blast_db"
- Cell Ranger ATAC Processing Description: Runs cellranger-atac to process raw scATAC fastq files and generate BAM and fragment files. Input: Raw fastq files, reference genome. Output: {NAME}/outs/possorted_bam.bam, {NAME}/outs/fragments.tsv
- Cell Barcode Cleaning Description: Formats and filters cell barcodes using the sinto tool. Input: BAM and fragment files from Cell Ranger ATAC output. Output: Filtered fragments and contig lists.
- Cell Identification (Socrates) Description: Uses Socrates to filter cell barcodes based on genomic regions. Output: Filtered barcodes and sparse matrices for further analysis.
- Demultiplexing Cells Description: Uses demuxlet with SNP data to demultiplex cells. Input: BAM file, VCF file of SNPs, sample names. Output: Best match demultiplex results, clean barcode list.
- Filtering Barcodes Description: Filters barcodes using picard to remove duplicates and produce a final BAM. Output: Filtered and indexed BAM files.
- Downsample Nuclei Description: Randomly downsamples barcodes to a specified count (e.g., 552 per sample). Input: List of clean barcodes. Output: Downsampled barcode files.
- ACR Calling Description: Uses Genrich to call Accessible Chromatin Regions (ACRs) based on downsampled BAM files. Output: Peak files and ACR log files.
- ACR Processing Description: Filters out regions aligning to plastid genomes and calculates jaccard similarity. Output: Processed and concatenated ACR files for frequency analysis.
- ACR Analysis and Visualization Description: Analyzes and visualizes ACR positions relative to genes. Output: Merged and classified ACR files, visualization plots.
- ACR Classification Description: Classifies ACRs based on proximity to genes and exons. Output: ACRs classified by gene location, overlap with exons, and nearest gene.
The pipeline produces a variety of output files, including:
Filtered BAM files: Located in DEMUX/BAMscATAC/ Downsampled ACR files: Stored in DownsampledBams/Peaks/ Merged ACRs: Final classified ACR files for downstream analysis Visualization files: Plots summarizing ACRs ACRs classified by genomic context ACRs_classified/