Tutorial

Synopsis

In this tutorial you will find instructions for

SV² execution
Constructing a genotype matrix
Merging divergent breakpoints

The version of SV² used in this guide is v1.4.0. You can find the source package under the Release tab

Tutorial Steps

SV² Guidelines

1. Install SV²

$ pip install sv2

1.1 Obtain hg19 FASTA file

Skip the commands below if you already have an indexed hg19 FASTA file

$ wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
$ gunzip human_g1k_v37.fasta.gz

# index FASTA
$ samtools faidx human_g1k_v37.fasta

1.2 Configure SV²

# download required resource files
$ sv2 -download

# define fasta location
$ sv2 -hg19 /full/path/to/human_g1k_v37.fasta

2. Download Tutorial Files

$ wget https://github.com/dantaki/SV2/releases/download/sv2v1.4.0/sv2_tutorial.zip
$ unzip sv2_tutorial.zip
$ cd sv2_tutorial/

Tutorial Files

HG00096_sub.bam
- Down-sampled low coverage Illumina Paired-End WGS from the 1000 Genomes Project. Aligned with BWA-MEM to hg19.
1kgp_phase3_tutorial.vcf
- A random selection of deletions and duplications from the 1000 Genomes Project phase 3 release
HG00096_sub.vcf.gz
- Variants were called with FreeBayes using the down-sampled BAM file. The VCF file was compressed with bgzip and indexed with tabix
HG00096.ped and 1kgp.ped
- plink PED file
sv2_tutorial_features/
- directory containing features from 26 high coverage genomes from the 1000 Genomes Project
lumpy.bed and manta.bed
- A selection of LUMPY and Manta predictions to merge

3. Run SV²

$ sv2 -i HG00096_sub.bam \
      -v 1kgp_phase3_tutorial.vcf \
      -snv HG00096_sub.vcf.gz \
      -p HG00096.ped \
      -o HG00096_sv2

Note: This tutorial is merely an example. The BAM file has been down-sampled to low depths (<1x) inadequate for genotyping.

Preprocessing output located in sv2_preprocessing/HG00096_sv2_preprocessing.txt

Features located in sv2_features/HG00096_sv2_features.txt

Genotypes located in sv2_genotypes/HG00096_sv2.vcf

Here's the STDOUT for the command above on my device

$ sv2 -i HG00096_sub.bam -v 1kgp_phase3_tutorial.vcf -snv HG00096_sub.vcf.gz -p HG00096.ped -o HG00096_sv2

    sv2 version:1.4.0    report bugs to <dantaki at ucsd dot edu>       error messages located in sv2_tutorial/sv2.err
    ************************************************
     PREPROCESSING COMPLETE <time elapsed: 0:01:00>
    ************************************************
    *****************************************************
     FEATURE EXTRACTION COMPLETE <time elapsed: 0:01:10>
    *****************************************************
    *********************************************
     GENOTYPING COMPLETE <time elapsed: 0:01:22>
    *********************************************

4. Genotype Matrix

Refer to the Guidelines for recommendations for creating a genotype matrix.

Squaring off the genotype matrix requires features from multiple samples, resulting in a VCF of genotypes for each sample.

The files in sv2_tutorial_features are features from a selection of SV calls from the 1000 Genomes phase 3 call set produced from 26 samples from the 1000 Genomes.

$ sv2 -feats sv2_tutorial_features/ -v 1kgp_phase3_tutorial.vcf -p 1kgp.ped -o 1kgp_sv2

This step does not require BAM files, but concatenates features from previous SV² runs.

Take a look at the output

$ less -S sv2_genotypes/1kgp_sv2.vcf

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00268	HG00419	HG00759	HG01051	HG01112	HG01500	HG01565	HG01583	HG01595	HG01879	HG02568	HG02922	HG03006	HG03052	HG03642	HG03742	NA12878	NA18525	NA18939	NA19017	NA19238	NA19239	NA19625	NA19648	NA20845
1	14437074	UW_VH_13213	.		30	PASS	END=14438948;SVTYPE=DEL;SVLEN=1875;DENOVO_FILTER=PASS;REF_GTL=18;AF=0.346;CYTOBAND=1p36.21;REPEATMASKER=NA,0;1000G_ID=UW_VH_13213;1000G_OVERLAP=1.00;DESCRIPTION=1:14437074-14438948_deletion;GENES=intergenic;ABPARTS=0;CENTROMERE=0;GAP=0;SEGDUP=0;STR=0;UNMAPPABLE=0.003	GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL	0/0:2.13:0.00:0.00:nan:0:nan:0:19:0.99,0.01,0.00	0/0:1.73:0.00:0.00:0.76:1:0.75:1:17:0.98,0.02,0.00	1/1:0.39:0.28:0.25:0.08:15:nan:0:19:0.01,0.00,0.99	0/1:1.03:0.14:0.09:0.43:3:0.60:3:33:0.00,0.98,0.01	0/1:1.17:0.15:0.09:0.47:6:0.44:4:36:0.00,0.98,0.02	0/0:1.89:0.00:0.00:0.22:1:nan:0:20:0.99,0.01,0.00	1/1:0.22:0.46:0.36:0.07:15:nan:0:20:0.01,0.00,0.99	0/1:0.98:0.09:0.07:0.72:6:0.23:6:31:0.00,0.99,0.01	0/0:2.19:0.00:0.00:nan:0:nan:0:18:0.98,0.02,0.00	0/1:1.21:0.13:0.11:0.38:2:0.50:2:34:0.00,0.98,0.02	0/0:1.86:0.00:0.00:1.07:2:nan:0:20:0.99,0.01,0.00	0/1:1.42:0.13:0.08:0.69:8:0.44:7:22:0.01,0.95,0.04	0/1:0.85:0.08:0.04:0.45:4:0.86:2:26:0.00,0.97,0.03	0/0:1.90:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00	0/1:0.95:0.09:0.07:0.36:6:0.66:6:30:0.00,0.99,0.01	1/1:0.37:0.40:0.22:0.36:3:nan:0:20:0.01,0.00,0.99	0/0:1.88:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00	0/0:2.13:0.00:0.00:0.34:1:nan:0:18:0.99,0.01,0.00	1/1:0.31:0.56:0.34:0.15:15:nan:0:19:0.01,0.00,0.99	0/0:1.72:0.00:0.00:nan:0:nan:0:17:0.98,0.02,0.00	0/0:1.88:0.00:0.00:nan:0:nan:0:20:0.99,0.01,0.00	0/1:0.99:0.10:0.08:0.52:2:0.68:2:31:0.00,0.99,0.01	0/0:1.72:0.00:0.00:0.61:3:0.95:3:17:0.98,0.02,0.00	0/0:1.55:0.00:0.00:nan:0:nan:0:8:0.84,0.16,0.00	0/1:1.20:0.07:0.07:0.32:3:0.86:3:31:0.00,0.99,0.01	0/1:1.19:0.08:0.05:0.16:3:1.00:3:32:0.00,0.99,0.01

5. Merge Divergent Breakpoints

Merge SV calls from LUMPY and Manta.

Genotypes for this section are merely examples. The alignment file is down-sampled to depths (<1x) inadequate for genotyping.

Note: we pass the pre sv2_preprocessing option to skip preprocessing.

# default merging 

$ sv2 -i HG00096_sub.bam \
      -b lumpy.bed manta.bed \
      -snv HG00096_sub.vcf.gz \
      -p HG00096.ped \
      -o HG00096_merged \
      -merge \
      -pre sv2_preprocessing/

By default, breakpoints with 80% reciprocal overlap or greater to each other are merged. The variant with the highest median ALT genotype likelihood is retained.

# merge breakpoints with 95% reciprocal overlap

$ sv2 -i HG00096_sub.bam \
      -b lumpy.bed manta.bed \
      -snv HG00096_sub.vcf.gz \
      -p HG00096.ped \
      -o HG00096_95merged \
      -min-ovr 0.95 
      -pre sv2_preprocessing/ -feats sv2_features/

Note: we pass the -feats sv2_features/ option to skip feature extraction

Quick check of the output

# print number of variants 

$ grep -P "^chr1" sv2_genotypes/HG00096_merged.vcf | wc -l
    352
$ grep -P "^chr1" sv2_genotypes/HG00096_95merged.vcf | wc -l
    361

We have more variants merged with the less stringent 80% overlap requirement.

SV² Guidelines

We recommend users to genotype SVs for one sample at a time. Multiple samples can be combined at a later time, without the need for BAM files.

After running SV² for each sample, place all the sv2_features/ output into one location (all_features/). Then create a BED or VCF file containing all unique SV calls $ cat *bed | sort -k1,1d -k2,2n | uniq >all-svs.bed.

Then run this command to square off the genotype matrix

$ sv2 -b all-svs.bed -feats all_features/ -p my.ped -o joint_sv2_genotypes

all_features/ contains SV² feature output from all samples.

my.ped is a PED file will all the samples to merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial

Synopsis

Tutorial Steps

1. Install SV²

1.1 Obtain hg19 FASTA file

1.2 Configure SV²

2. Download Tutorial Files

Tutorial Files

3. Run SV²

4. Genotype Matrix

5. Merge Divergent Breakpoints

SV² Guidelines

Clone this wiki locally

Tutorial

Synopsis

Tutorial Steps

1. Install SV2

1.1 Obtain hg19 FASTA file

1.2 Configure SV2

2. Download Tutorial Files

Tutorial Files

3. Run SV2

4. Genotype Matrix

5. Merge Divergent Breakpoints

SV2 Guidelines

Clone this wiki locally

1. Install SV²

1.2 Configure SV²

3. Run SV²

SV² Guidelines