SVJedi-graph is a structural variation (SV) genotyper for long read data. It takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and outputs the initial variant file with an additional column containing genotyping information (VCF).
SVjedi-graph is based on a representation of the genome and the different SV alleles in a variation graph. After building this variation graph from the reference genome sequence and the input variant file, long reads are mapped on this graph using minigraph1. Then it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts.
Currently, SVJedi-graph can genotype five types of SVs: deletions, insertions, duplications, inversions and translocations (intra- and inter-chromosomal).
SVJedi-graph requires :
- Python (3.8.13 or higher)
- minigraph
conda install -c bioconda svjedi-graph
git clone https://gitlab.inria.fr/sromain/svjedi-graph.git
./svjedi-graph.py -v <inputVCF> -r <refFA> -q <longreadsFQ> [ -p <output_prefix> -t <threads> -ms <minsupport> ]
For all variants, the SVTYPE
tag must be present in the INFO
field (SVTYPE=DEL
or SVTYPE=INS
or SVTYPE=INV
or SVTYPE=BND
). Insertions need to be sequence-resolved with the full inserted sequence characterized and reported in the ALT field of the VCF file. As duplications are a special case of insertions, SVJedi-graph supports also duplications, as long as their duplicated sequence is characterized and reported similarly to insertions. More details are given in SV representation in VCF.
To check that SVJedi-graph behaves as expected on your device, you can run:
cd test-dir/
./run_test.sh
To explore the output files on a small dataset, run:
mkdir outputfiles
cd outputfiles
./../svjedi-graph.py -v ../test-dir/test.vcf -r ../test-dir/reference_genome.fasta -q ../test-dir/simulated_reads.fastq.gz -p test
-v
--vcf
VCF file containing the set of SVs to genotype.-r
--ref
FASTA file containing the reference genome (on which the SVs have been identified).-q
--reads
FASTQ file containing the long reads used to genotype. If you have multiple FASTQ files for one individual, use,
as a filename separator.-p
--prefix
Prefix of output files.-t
--threads
Number of threads to use for the mapping step.-ms
--minsupport
Minimum number of alignments to genotype a SV (default: 3>=).
Main output file:
<prefix>_genotype.vcf
Genotyped SVs set in VCF format.
Intermediate output files:
<prefix>.gfa
Variation graph in GFA format.<prefix>.gaf
Mapping results from minigraph in GAF format.<prefix>_informative_aln.json
Json dictionnary of read supports for each input SV's alleles.
Here are the information needed for SVJedi-graph to genotype the following SV types. All variants must have the CHROM
and POS
fields defined, with the chromosome names in the reference genome file and variant file that must be the same. The SVTYPE
tag must be present in the INFO field (SVTYPE=DEL
or SVTYPE=INS
or SVTYPE=INV
or SVTYPE=BND
). Then additional information is required according to SV type:
-
Deletion
INFO
field must containSVTYPE=DEL
INFO
field must containEND=pos
(withpos
being the end position of the deleted segment)
-
Insertion
INFO
field must containSVTYPE=INS
ALT
field must contain the sequence of the insertion
-
Duplication
- must be defined as an insertion event whith
CHR
andPOS
corresponding to the position of insertion of the novel copy INFO
field must containSVTYPE=INS
ALT
field must contain the sequence of the duplication
- must be defined as an insertion event whith
-
Inversion
INFO
field must containSVTYPE=INV
INFO
field must containEND=pos
tag, withpos
being the second breakpoint position
-
Intra-chromosomal translocation
INFO
field must containSVTYPE=BND
ALT
field must be formated as:t[pos[
,t]pos]
,]pos]t
or[pos[t
, withpos
indicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined together
Sandra Romain, Claire Lemaitre, SVJedi-graph: improving the genotyping of close and overlapping structural variants with long reads using a variation graph, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i270–i278, https://doi.org/10.1093/bioinformatics/btad237
SVJedi-graph is a Genscale tool developed by Sandra Romain and Claire Lemaitre. For any bug report or feedback, please use the Github Issues form.
Footnotes
-
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 21, 265 (2020). https://doi.org/10.1186/s13059-020-02168-z ↩