Skip to content

Download test data

Vivekanandan Ramalingam edited this page Mar 19, 2024 · 8 revisions

Experimental dataset

For this tutorial we'll use experimental CHIP-seq data, for the transcription factor CTCF in the K562 cell line, which is available on the ENCODE data portal. There are 5 such experiments that we find in ENCODE, you can see them listed here CHIP-seq CTCF K562 . We'll restrict ourselves to one experiment ENCSR000EGM

Experiment bams

Download the .bam files for the two replicates for the transcription factor CTCF in the K562 cell line, which is available on the ENCODE data portal ENCSR000EGM.

The two replicates are isogenic replicates (biological). A more detailed explanation of the various types of replicates can be found here.

Links to the replicate bam files provided below.

ENCFF198CVB

ENCFF488CXC

wget https://www.encodeproject.org/files/ENCFF198CVB/@@download/ENCFF198CVB.bam -O rep1.bam
wget https://www.encodeproject.org/files/ENCFF488CXC/@@download/ENCFF488CXC.bam -O rep2.bam

Control bams

Now download the bam files from control ENCSR000EHI for the experiment, which is available here:

ENCFF023NGN

wget https://www.encodeproject.org/files/ENCFF023NGN/@@download/ENCFF023NGN.bam -O control.bam

Reference files and blacklist regions

Finally, download the reference files. In the example below, some preprocessing is required to filter out unwanted chromosomes from the hg38.chrom.sizes file. Additionally, the blacklist file shown is specific to hg38, and should be replaced with a genome-specific blacklist if alternative genomes are used.

Available Blacklists:

For those interested in using the blacklists, a current version for dm3, dm6, ce10, ce11, mm10, hg19, and hg38 are available in the lists/ folder at https://github.com/Boyle-Lab/Blacklist/

Please cite:

Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z

# download genome refrence
wget https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz \
-O hg38.genome.fa.gz | gunzip

# index genome reference
samtools faidx hg38.genome.fa

# download chrom sizes
wget https://www.encodeproject.org/files/GRCh38_EBV.chrom.sizes/@@download/GRCh38_EBV.chrom.sizes.tsv

# exclude alt contigs and chrEBV
grep -v -e '_' -e 'chrEBV' GRCh38_EBV.chrom.sizes.tsv > hg38.chrom.sizes
rm GRCh38_EBV.chrom.sizes.tsv

# make file with chromosomes only
awk '{print $1}' hg38.chrom.sizes > chroms.txt

# download blacklist
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz -O blacklist.bed.gz
gunzip blacklist.bed.gz