The HuBMAP Consortium sc-atac-seq pipeline is a pipeline for analyzing scATAC-seq data sets, composed of ArchR, and chromVAR. Source code can be found at https://github.com/hubmapconsortium/sc-atac-seq-pipeline
The pipeline performs quantification using a specified aligner, and HuBMAP has standardized on BWA with the GRCh38 reference genome. ArchR divides the genome into non-overlapping bins of user-specified size (we use 500), produces FASTQC analysis of the input fastq files, and produces a binary cell-by-bin matrix denoting whether reads in each cell were aligned to each bin.
The ArchR secondary analysis pipeline filters bins based on TSS enrichment and fragment number, performs LSI dimensionality reduction, and selects peaks from all available bins. The chromVAR tool performs motif analysis, assigns motifs to transcription factors, and computes differential enrichment of transcription factors across cells in the data set.
Running the pipeline requires a CWL workflow execution engine, and we recommend the cwltool reference implementation, which is written in Python. This can be installed in a sufficiently recent Python environment with pip install cwltool, after which the pipeline can be invoked as:
cwltool sc_atac_seq_prep_process_analyze.cwl sc_atac_seq_prep_process_analyze.json
To build the Docker images run
build_docker_containers
from the sc-atac-seq directory. The build could take up to an hour.
The HuBMAP sc-atac-seq pipeline uses the Genome Reference Consortium human genome, build 38 (GRCh38). A BWA generated set of index files is required for the reference genome. Using an alternate reference or index is not currently supported without rebuilding the sc-atac-seq Docker container, though one can build an alternate container by modifying the Dockerfile.
-
sequence_directory
A directory for the pipeline to search for fastq or fastq.gz files. The pipeline only works on paired end reads and expects, for historical reasons, the paired end read files to be named<some_name>*_R1*.fastq
and<some_name>*_R3*.fastq
. If a file containing barcodes<some_name>*_R2*.fastq
is found the barcodes will be read and added to the read IDs in the paired end fastq files -
input_reference_genome
A fasta file of the GRCh38 reference genome
- reference_genome_index
A .gz file containing the BWA generated index of the GRCh38 reference genome. I.e. the ".bwt", ".sa", ".ann", ".pac", ".amb" files generated by BWA indexing. If this file is provided the index will not have to be generated by the pipeline saving some time.
-
Bins.csv
A CSV file providing sequence name and bin information -
cellBarcodes.CSV
A CSV file with barcode ID and barcode -
cellByBin_summary.csv
A CSV file with barcode ID and bin number -
cellClusterAssignment.csv
A CSV file with barcode ID and cluster number -
GenesRanges.csv
A CSV file providing sequence, gene name and gene location information -
cellByGene.mtx
A file with the cell by gene matrix in Matrix Market format -
cellGenes.csv
A CSV file with gene ID and gene name -
peaksAllCells.csv
A CSV file with sequence name and peak start and end