Double entries from merge command #16

Thatguy027 · 2018-03-14T16:40:48Z

Hi there,

I just tested this tool for combining the outputs of Delly, Manta, and TIDDIT. It seems to work great!

I'm not necessarily sure this is an issue and you probably already know -

Here is a command I ran to merge three VCFs for one sample :

svdb --merge --pass_only --no_var --vcf ${dellysv}:delly ${mantasv}:manta ${tidditsv}:tiddit --priority delly,manta,tiddit

Note that I added --pass_only and --no_var after I noticed that when two softwares identify different variants at the same position two lines are output. I'm not sure that I noticed any difference when these flags were there vs when they weren't. From what I can tell, duplicate position lines break downstream processing with bcftools... errors on first occurrence of duplicate record. I haven't tested what would happen if I ran bcftools norm -m +any on the svdb --merge output prior to additional processing, because that is typically used for multi-allelic sites from what I understand.

I got around the issue with a "take first line if the next line has the same CHROM POS" awk command, but wanted to let you know. Maybe, the --priority flag can have that built in if the user requests it?

Another minor thing I noticed is that my chromosome order got rearranged after the --merge command - From I,II,III,IV,V,X,MtDNA to I,II,II,IV,MtDNA,V,X, I am assuming this happens because the output sorts the chromosomes by alpha-numeric?

Apart from that, this tool seems great! I'll let you know if I notice anything else.

Best,
S

The text was updated successfully, but these errors were encountered:

J35P312 · 2018-03-15T11:22:30Z

Hello there!
That's a tricky issue, it usually happens if the start position of the SV is the same, but the end position differs greatly. I have an example here (cnvnator calls from two individuals):

1 24257001 CNVnator_del_34 N . PASS END=24363000;SVTYPE=DEL;SVLEN=-106000
1 24257001 CNVnator_del_18 N ~~. PASS END=24319000;SVTYPE=DEL;SVLEN=-62000;~~

These two will not be merged, because their END position differs too much. You may avoid a few of these cases by changing the --overlap and --bnd_distance parameters, but there will always be a risk of getting a few of these.
Some --priority flag option sounds like a good idea, I will think about it!
In essence we would need some nice way of representing these SV as multiallelic sites.

I just made a fix to SVDB, if your vcf files have "contig entries" in the vcf-header, the chromosomes should be sorted correctly!
You are correct, otherwise SVDB will sort the chromosomes by lexiographic order.

Thanks! Happy to hear that the tool works for you! It would be great to know if you find any bugs or ideas of new features!
Regards
//Jesper

dnil · 2019-11-20T13:05:39Z

Bump on this one due to a recent MIP issue, although the use case was somewhat unexpected. I could perhaps phrase this as a question: should the docs perhaps state that SVDB merge require input to be split (and normalised)? I have a feeling it won't really handle many-way merging of multi-allelic input entries confidently?

sjurug · 2020-04-24T07:06:32Z

Hi, @J35P312

We also experience a similar problem

We observed that overlapping variants are not merged when trying to merge sequence resolved and non-sequence resolved variants from Truvari.

I think that this is what happens

Truvari removes the END info field for sequence resolved DEL variants
In svdb/readVCF.py, this leads to END being estimated as posB=posA+int(description["SVLEN"]), but for DEL case SVLEN is negative
Later, this leads to posA and posB being interchanged.
The new variant is no longer overlapping the original variant

Intuitively, I would change posB=posA+int(description["SVLEN"]) -> posB=posA+abs(int(description["SVLEN"])), but maybe you could have a look at it?
Maybe also add support for the cases when sequence resolved variants have neither END nor SVLEN?

J35P312 · 2020-04-24T09:36:48Z

Woopsiedaisy!
And thanks for your suggestions!

could you send me the variants you are trying to merge?
Then I will look in to this!

//JEsper

sjurug · 2020-04-24T10:45:39Z

Thank you for looking into this!

I have tried to make a minimal test set consisting of

Manta-style sequence resolved variant without END
Delly-style non-sequence resolved variant with END
Result from SVDB merge

I have been able to get the expected overlap by either

adding END to Manta
enforcing a positive SVLEN for Manta
svdb_16_testdata.tar.gz

J35P312 · 2020-04-27T08:46:31Z

Hello!
I have had a look, thanks for the test data!
I notice you are correct, I have edited the readvcf function to this:

posB=posA+abs(int(description["SVLEN"]))

and now the variants merge nicely!

1 95690510 MantaDEL:manta_95690510|DEL00000389:delly_95690510 AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT A994 PASS SVTYPE=DEL;SVLEN=-319;END=95690830;VARID=DEL00000389:delly_95690510;set=Intersection GT 1/1

Feel free to give this a try!
//JEsper

sjurug · 2020-04-28T06:57:59Z

Thank you!

sjurug · 2021-08-16T12:07:51Z

Hi again,

It seems that also INS is affected by a similar issue

two INS with different POS will not merge despite equal lengths
I believe that it is caused by SVDB defaulting to posB=END internally instead of defaulting to posB=POS+abs(SVLEN) in readVCF.py
If using posB=POS+abs(SVLEN) as default (and posB=END if there is no SVLEN), the variants will merge
I belive that the origin of this problem is that END is a coordinate in the reference, while SVLEN is measured in some other coordinate system. Internally in SVDB this distinction does not matter, but when exporting the database-VCF, we get END=POS+abs(SVLEN), which does not work for INS. Again this does not matter when using SVDB for querying, but even after the suggested fix it could potentially lead to some confusion for people investigating the VCF manually.

Attached are tree INSes that only merge if I use the suggested fix. I have tested this

merging the variants
making a VCF-database of two INSes
querying the third INS using the VCF-database

svdb_16_testdata_ins.tar.gz

J35P312 · 2021-09-01T14:52:04Z

Hello!
Sorry for the delay!
I have had a look, and implemented separate treatment of the INS variants. now PosB=PosA (i.e all ins will have lenght 0 in the reference); in essence, the insertions will be treated as single points. I also added a new parameter "--ins_distance" allowing the insertions to be clustered with a different distance than bnd_distance.

these changes are found in the branch "insertion_distance". Feel free to try it!

To me this solves the issue of END=POS+abs(SVLEN), insertions not being merged properly etc in a good enough way.
Ideally, we should merge based on sequence similarity, but this is tricky since different callers report insertions in such different formats.

according to these rules, all the insertions in the test data get merged.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002 HG003 HG004
1 95690510 MantaINS:test_ins1|MantaINS:test_ins2|MantaINS:test_ins3 A AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT 994 PASS END=95690510;SVTYPE=INS;SVLEN=319;VARID=MantaINS:test_ins2|MantaINS:test_ins3;set=Intersection GT 1/1 1/1 1/1

Do you agree with these rules? Feel free to comment! I will do some tests on mobile element insertions etc, probably I will merge the insertion_distance by the end of the week!

sjurug · 2021-09-09T07:08:36Z

Thanks!
I think that this solution is a good first approximation. If we are really afraid of over-estimating the frequency, we can use a very small value such as --ins_distance=0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double entries from merge command #16

Double entries from merge command #16

Thatguy027 commented Mar 14, 2018

J35P312 commented Mar 15, 2018

dnil commented Nov 20, 2019

sjurug commented Apr 24, 2020

J35P312 commented Apr 24, 2020

sjurug commented Apr 24, 2020

J35P312 commented Apr 27, 2020

sjurug commented Apr 28, 2020

sjurug commented Aug 16, 2021

J35P312 commented Sep 1, 2021

sjurug commented Sep 9, 2021

Double entries from merge command #16

Double entries from merge command #16

Comments

Thatguy027 commented Mar 14, 2018

J35P312 commented Mar 15, 2018

dnil commented Nov 20, 2019

sjurug commented Apr 24, 2020

J35P312 commented Apr 24, 2020

sjurug commented Apr 24, 2020

J35P312 commented Apr 27, 2020

sjurug commented Apr 28, 2020

sjurug commented Aug 16, 2021

J35P312 commented Sep 1, 2021

sjurug commented Sep 9, 2021