Skip to content

Tutorial

Danny Antaki edited this page Mar 8, 2018 · 31 revisions

Synopsis

In this tutorial you will find instructions for

  • SV2 execution
  • Constructing a genotype matrix
  • Merging divergent breakpoints

The version of SV2 used in this guide is v1.4.0. You can find the source package under the Release tab


Tutorial Steps

  1. Install SV2
  2. Download Tutorial Files
  3. Run SV2
  4. Make a Genotype Matrix
  5. Merge divergent breakpoints

1. Install SV2

$ pip install sv2

1.1 Obtain hg19 FASTA file

Skip the commands below if you already have an indexed hg19 FASTA file

$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
$ gunzip human_g1k_v37.fasta.gz

# index FASTA
$ samtools faidx human_g1k_v37.fasta

1.2 Configure SV2

# download required resource files
$ sv2 -download

# define fasta location
$ sv2 -hg19 /full/path/to/human_g1k_v37.fasta

2. Download Tutorial Files

$ wget https://github.com/dantaki/SV2/releases/download/sv2v1.4.0/sv2_tutorial.zip
$ unzip sv2_tutorial.zip
$ cd sv2_tutorial/

Tutorial Files

  • HG00096_sub.bam

    • Down-sampled low coverage Illumina Paired-End WGS from the 1000 Genomes Project. Aligned with BWA-MEM to hg19.
  • 1kgp_phase3_tutorial.vcf

    • A random selection of deletions and duplications from the 1000 Genomes Project phase 3 release
  • HG00096_sub.vcf.gz

    • Variants were called with FreeBayes using the down-sampled BAM file. The VCF file was compressed with bgzip and indexed with tabix
  • HG00096.ped and 1kgp.ped

  • sv2_tutorial_features/

    • directory containing features from 26 high coverage genomes from the 1000 Genomes Project
  • lumpy.bed and manta.bed

    • A selection of LUMPY and Manta predictions to merge

3. Run SV2

$ sv2 -i HG00096_sub.bam \
      -v 1kgp_phase3_tutorial.vcf \
      -snv HG00096_sub.vcf.gz \
      -p HG00096.ped \
      -o HG00096_sv2

Note: This tutorial is merely an example. The BAM file has been down-sampled to low depths (<1x) inadequate for genotyping.

Preprocessing output located in sv2_preprocessing/HG00096_sv2_preprocessing.txt

Features located in sv2_features/HG00096_sv2_features.txt

Genotypes located in sv2_genotypes/HG00096_sv2.vcf

Here's the STDOUT for the command above on my device

$ sv2 -i HG00096_sub.bam -v 1kgp_phase3_tutorial.vcf -snv HG00096_sub.vcf.gz -p HG00096.ped -o HG00096_sv2

    sv2 version:1.4.0    report bugs to <dantaki at ucsd dot edu>       error messages located in sv2_tutorial/sv2.err
    ************************************************
     PREPROCESSING COMPLETE <time elapsed: 0:01:00>
    ************************************************
    *****************************************************
     FEATURE EXTRACTION COMPLETE <time elapsed: 0:01:10>
    *****************************************************
    *********************************************
     GENOTYPING COMPLETE <time elapsed: 0:01:22>
    *********************************************

4. Genotype Matrix

Refer to the Guidelines for recommendations for creating a genotype matrix.

Squaring off the genotype matrix requires features from multiple samples, resulting in a VCF of genotypes for each sample.

The files in sv2_tutorial_features are features from a selection of SV calls from the 1000 Genomes phase 3 call set produced from 26 samples from the 1000 Genomes.

$ sv2 -feats sv2_tutorial_features/ -v 1kgp_phase3_tutorial.vcf -p 1kgp.ped -o 1kgp_sv2

This step does not require BAM files, but concatenates features from previous SV2 runs.

Take a look at the output

$ less -S sv2_genotypes/1kgp_sv2.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00268 HG00419 HG00759 HG01051 HG01112 HG01500 HG01565 HG01583 HG01595 HG01879 HG02568 HG02922 HG03006 HG03052 HG03642 HG03742 NA12878 NA18525 NA18939 NA19017 NA19238 NA19239 NA19625 NA19648 NA20845
1 14437074 UW_VH_13213 . 30 PASS END=14438948;SVTYPE=DEL;SVLEN=1875;DENOVO_FILTER=PASS;REF_GTL=18;AF=0.346;CYTOBAND=1p36.21;REPEATMASKER=NA,0;1000G_ID=UW_VH_13213;1000G_OVERLAP=1.00;DESCRIPTION=1:14437074-14438948_deletion;GENES=intergenic;ABPARTS=0;CENTROMERE=0;GAP=0;SEGDUP=0;STR=0;UNMAPPABLE=0.003 GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL 0/0:2.13:0.00:0.00:nan:0:nan:0:19:0.99,0.01,0.00 0/0:1.73:0.00:0.00:0.76:1:0.75:1:17:0.98,0.02,0.00 1/1:0.39:0.28:0.25:0.08:15:nan:0:19:0.01,0.00,0.99 0/1:1.03:0.14:0.09:0.43:3:0.60:3:33:0.00,0.98,0.01 0/1:1.17:0.15:0.09:0.47:6:0.44:4:36:0.00,0.98,0.02 0/0:1.89:0.00:0.00:0.22:1:nan:0:20:0.99,0.01,0.00 1/1:0.22:0.46:0.36:0.07:15:nan:0:20:0.01,0.00,0.99 0/1:0.98:0.09:0.07:0.72:6:0.23:6:31:0.00,0.99,0.01 0/0:2.19:0.00:0.00:nan:0:nan:0:18:0.98,0.02,0.00 0/1:1.21:0.13:0.11:0.38:2:0.50:2:34:0.00,0.98,0.02 0/0:1.86:0.00:0.00:1.07:2:nan:0:20:0.99,0.01,0.00 0/1:1.42:0.13:0.08:0.69:8:0.44:7:22:0.01,0.95,0.04 0/1:0.85:0.08:0.04:0.45:4:0.86:2:26:0.00,0.97,0.03 0/0:1.90:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00 0/1:0.95:0.09:0.07:0.36:6:0.66:6:30:0.00,0.99,0.01 1/1:0.37:0.40:0.22:0.36:3:nan:0:20:0.01,0.00,0.99 0/0:1.88:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00 0/0:2.13:0.00:0.00:0.34:1:nan:0:18:0.99,0.01,0.00 1/1:0.31:0.56:0.34:0.15:15:nan:0:19:0.01,0.00,0.99 0/0:1.72:0.00:0.00:nan:0:nan:0:17:0.98,0.02,0.00 0/0:1.88:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00 0/1:0.99:0.10:0.08:0.52:2:0.68:2:31:0.00,0.99,0.01 0/0:1.72:0.00:0.00:0.61:3:0.95:3:17:0.98,0.02,0.00 0/0:1.55:0.00:0.00:nan:0:nan:0:8:0.84,0.16,0.00 0/1:1.20:0.07:0.07:0.32:3:0.86:3:31:0.00,0.99,0.01 0/1:1.19:0.08:0.05:0.16:3:1.00:3:32:0.00,0.99,0.01

5. Merge Divergent Breakpoints

Merge SV calls from LUMPY and Manta.

Genotypes for this section are merely examples. The alignment file is down-sampled to depths (<1x) inadequate for genotyping.

Note: we pass the pre sv2_preprocessing option to skip preprocessing.

# default merging 

$ sv2 -i HG00096_sub.bam \
      -b lumpy.bed manta.bed \
      -snv HG00096_sub.vcf.gz \
      -p HG00096.ped \
      -o HG00096_merged \
      -merge \
      -pre sv2_preprocessing/

By default, breakpoints with 80% reciprocal overlap or greater to each other are merged. The variant with the highest median ALT genotype likelihood is retained.

# merge breakpoints with 95% reciprocal overlap

$ sv2 -i HG00096_sub.bam \
      -b lumpy.bed manta.bed \
      -snv HG00096_sub.vcf.gz \
      -p HG00096.ped \
      -o HG00096_95merged \
      -min-ovr 0.95 
      -pre sv2_preprocessing/ -feats sv2_features/

Note: we pass the -feats sv2_features/ option to skip feature extraction

Quick check of the output

# print number of variants 

$ grep -P "^chr1" sv2_genotypes/HG00096_merged.vcf | wc -l
    352
$ grep -P "^chr1" sv2_genotypes/HG00096_95merged.vcf | wc -l
    361

We have more variants merged with the less stringent 80% overlap requirement.

SV2 Guidelines

We recommend users to genotype SVs for one sample at a time. Multiple samples can be combined at a later time, without the need for BAM files.

After running SV2 for each sample, place all the sv2_features/ output into one location (all_features/). Then create a BED or VCF file containing all unique SV calls $ cat *bed | sort -k1,1d -k2,2n | uniq >all-svs.bed.

Then run this command to square off the genotype matrix

$ sv2 -b all-svs.bed -feats all_features/ -p my.ped -o joint_sv2_genotypes

all_features/ contains SV2 feature output from all samples.

my.ped is a PED file will all the samples to merge

Clone this wiki locally