-
Notifications
You must be signed in to change notification settings - Fork 11
Tutorial
In this tutorial you will find instructions for
- SV2 execution
- Constructing a genotype matrix
- Merging divergent breakpoints
The version of SV2 used in this guide is v1.4.0
. You can find the source package under the Release tab
$ pip install sv2
Skip the commands below if you already have an indexed hg19 FASTA file
$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
$ gunzip human_g1k_v37.fasta.gz
# index FASTA
$ samtools faidx human_g1k_v37.fasta
# download required resource files
$ sv2 -download
# define fasta location
$ sv2 -hg19 /full/path/to/human_g1k_v37.fasta
$ wget https://github.com/dantaki/SV2/releases/download/sv2v1.4.0/sv2_tutorial.zip
$ unzip sv2_tutorial.zip
$ cd sv2_tutorial/
-
HG00096_sub.bam
- Down-sampled low coverage Illumina Paired-End WGS from the 1000 Genomes Project. Aligned with BWA-MEM to hg19.
-
1kgp_phase3_tutorial.vcf
- A random selection of deletions and duplications from the 1000 Genomes Project phase 3 release
-
HG00096_sub.vcf.gz
-
HG00096.ped
and1kgp.ped
-
sv2_tutorial_features/
- directory containing features from 26 high coverage genomes from the 1000 Genomes Project
-
lumpy.bed
andmanta.bed
$ sv2 -i HG00096_sub.bam \
-v 1kgp_phase3_tutorial.vcf \
-snv HG00096_sub.vcf.gz \
-p HG00096.ped \
-o HG00096_sv2
Note: This tutorial is merely an example. The BAM file has been down-sampled to low depths (<1x) inadequate for genotyping.
Preprocessing output located in sv2_preprocessing/HG00096_sv2_preprocessing.txt
Features located in sv2_features/HG00096_sv2_features.txt
Genotypes located in sv2_genotypes/HG00096_sv2.vcf
Here's the STDOUT for the command above on my device
$ sv2 -i HG00096_sub.bam -v 1kgp_phase3_tutorial.vcf -snv HG00096_sub.vcf.gz -p HG00096.ped -o HG00096_sv2
sv2 version:1.4.0 report bugs to <dantaki at ucsd dot edu> error messages located in sv2_tutorial/sv2.err
************************************************
PREPROCESSING COMPLETE <time elapsed: 0:01:00>
************************************************
*****************************************************
FEATURE EXTRACTION COMPLETE <time elapsed: 0:01:10>
*****************************************************
*********************************************
GENOTYPING COMPLETE <time elapsed: 0:01:22>
*********************************************
Refer to the Guidelines for recommendations for creating a genotype matrix.
Squaring off the genotype matrix requires features from multiple samples, resulting in a VCF of genotypes for each sample.
The files in sv2_tutorial_features
are features from a selection of SV calls from the 1000 Genomes phase 3 call set produced from 26 samples from the 1000 Genomes.
$ sv2 -feats sv2_tutorial_features/ -v 1kgp_phase3_tutorial.vcf -p 1kgp.ped -o 1kgp_sv2
This step does not require BAM files, but concatenates features from previous SV2 runs.
Take a look at the output
$ less -S sv2_genotypes/1kgp_sv2.vcf
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | HG00096 | HG00268 | HG00419 | HG00759 | HG01051 | HG01112 | HG01500 | HG01565 | HG01583 | HG01595 | HG01879 | HG02568 | HG02922 | HG03006 | HG03052 | HG03642 | HG03742 | NA12878 | NA18525 | NA18939 | NA19017 | NA19238 | NA19239 | NA19625 | NA19648 | NA20845 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 14437074 | UW_VH_13213 | . | 30 | PASS | END=14438948;SVTYPE=DEL;SVLEN=1875;DENOVO_FILTER=PASS;REF_GTL=18;AF=0.346;CYTOBAND=1p36.21;REPEATMASKER=NA,0;1000G_ID=UW_VH_13213;1000G_OVERLAP=1.00;DESCRIPTION=1:14437074-14438948_deletion;GENES=intergenic;ABPARTS=0;CENTROMERE=0;GAP=0;SEGDUP=0;STR=0;UNMAPPABLE=0.003 | GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL | 0/0:2.13:0.00:0.00:nan:0:nan:0:19:0.99,0.01,0.00 | 0/0:1.73:0.00:0.00:0.76:1:0.75:1:17:0.98,0.02,0.00 | 1/1:0.39:0.28:0.25:0.08:15:nan:0:19:0.01,0.00,0.99 | 0/1:1.03:0.14:0.09:0.43:3:0.60:3:33:0.00,0.98,0.01 | 0/1:1.17:0.15:0.09:0.47:6:0.44:4:36:0.00,0.98,0.02 | 0/0:1.89:0.00:0.00:0.22:1:nan:0:20:0.99,0.01,0.00 | 1/1:0.22:0.46:0.36:0.07:15:nan:0:20:0.01,0.00,0.99 | 0/1:0.98:0.09:0.07:0.72:6:0.23:6:31:0.00,0.99,0.01 | 0/0:2.19:0.00:0.00:nan:0:nan:0:18:0.98,0.02,0.00 | 0/1:1.21:0.13:0.11:0.38:2:0.50:2:34:0.00,0.98,0.02 | 0/0:1.86:0.00:0.00:1.07:2:nan:0:20:0.99,0.01,0.00 | 0/1:1.42:0.13:0.08:0.69:8:0.44:7:22:0.01,0.95,0.04 | 0/1:0.85:0.08:0.04:0.45:4:0.86:2:26:0.00,0.97,0.03 | 0/0:1.90:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00 | 0/1:0.95:0.09:0.07:0.36:6:0.66:6:30:0.00,0.99,0.01 | 1/1:0.37:0.40:0.22:0.36:3:nan:0:20:0.01,0.00,0.99 | 0/0:1.88:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00 | 0/0:2.13:0.00:0.00:0.34:1:nan:0:18:0.99,0.01,0.00 | 1/1:0.31:0.56:0.34:0.15:15:nan:0:19:0.01,0.00,0.99 | 0/0:1.72:0.00:0.00:nan:0:nan:0:17:0.98,0.02,0.00 | 0/0:1.88:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00 | 0/1:0.99:0.10:0.08:0.52:2:0.68:2:31:0.00,0.99,0.01 | 0/0:1.72:0.00:0.00:0.61:3:0.95:3:17:0.98,0.02,0.00 | 0/0:1.55:0.00:0.00:nan:0:nan:0:8:0.84,0.16,0.00 | 0/1:1.20:0.07:0.07:0.32:3:0.86:3:31:0.00,0.99,0.01 | 0/1:1.19:0.08:0.05:0.16:3:1.00:3:32:0.00,0.99,0.01 |
Merge SV calls from LUMPY and Manta.
Genotypes for this section are merely examples. The alignment file is down-sampled to depths (<1x) inadequate for genotyping.
Note: we pass the pre sv2_preprocessing
option to skip preprocessing.
# default merging
$ sv2 -i HG00096_sub.bam \
-b lumpy.bed manta.bed \
-snv HG00096_sub.vcf.gz \
-p HG00096.ped \
-o HG00096_merged \
-merge \
-pre sv2_preprocessing/
By default, breakpoints with 80% reciprocal overlap or greater to each other are merged. The variant with the highest median ALT genotype likelihood is retained.
# merge breakpoints with 95% reciprocal overlap
$ sv2 -i HG00096_sub.bam \
-b lumpy.bed manta.bed \
-snv HG00096_sub.vcf.gz \
-p HG00096.ped \
-o HG00096_95merged \
-min-ovr 0.95
-pre sv2_preprocessing/ -feats sv2_features/
Note: we pass the -feats sv2_features/
option to skip feature extraction
Quick check of the output
# print number of variants
$ grep -P "^chr1" sv2_genotypes/HG00096_merged.vcf | wc -l
352
$ grep -P "^chr1" sv2_genotypes/HG00096_95merged.vcf | wc -l
361
We have more variants merged with the less stringent 80% overlap requirement.
We recommend users to genotype SVs for one sample at a time. Multiple samples can be combined at a later time, without the need for BAM files.
After running SV2 for each sample, place all the sv2_features/
output into one location (all_features/
). Then create a BED or VCF file containing all unique SV calls $ cat *bed | sort -k1,1d -k2,2n | uniq >all-svs.bed
.
Then run this command to square off the genotype matrix
$ sv2 -b all-svs.bed -feats all_features/ -p my.ped -o joint_sv2_genotypes
all_features/
contains SV2 feature output from all samples.
my.ped
is a PED file will all the samples to merge