Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double entries from merge command #16

Open
Thatguy027 opened this issue Mar 14, 2018 · 10 comments
Open

Double entries from merge command #16

Thatguy027 opened this issue Mar 14, 2018 · 10 comments

Comments

@Thatguy027
Copy link

Hi there,

I just tested this tool for combining the outputs of Delly, Manta, and TIDDIT. It seems to work great!

I'm not necessarily sure this is an issue and you probably already know -

Here is a command I ran to merge three VCFs for one sample :

svdb --merge --pass_only --no_var --vcf ${dellysv}:delly ${mantasv}:manta ${tidditsv}:tiddit --priority delly,manta,tiddit

Note that I added --pass_only and --no_var after I noticed that when two softwares identify different variants at the same position two lines are output. I'm not sure that I noticed any difference when these flags were there vs when they weren't. From what I can tell, duplicate position lines break downstream processing with bcftools... errors on first occurrence of duplicate record. I haven't tested what would happen if I ran bcftools norm -m +any on the svdb --merge output prior to additional processing, because that is typically used for multi-allelic sites from what I understand.

I got around the issue with a "take first line if the next line has the same CHROM POS" awk command, but wanted to let you know. Maybe, the --priority flag can have that built in if the user requests it?

Another minor thing I noticed is that my chromosome order got rearranged after the --merge command - From I,II,III,IV,V,X,MtDNA to I,II,II,IV,MtDNA,V,X, I am assuming this happens because the output sorts the chromosomes by alpha-numeric?

Apart from that, this tool seems great! I'll let you know if I notice anything else.

Best,
S

@J35P312
Copy link
Owner

J35P312 commented Mar 15, 2018

Hello there!
That's a tricky issue, it usually happens if the start position of the SV is the same, but the end position differs greatly. I have an example here (cnvnator calls from two individuals):

1 24257001 CNVnator_del_34 N . PASS END=24363000;SVTYPE=DEL;SVLEN=-106000
1 24257001 CNVnator_del_18 N . PASS END=24319000;SVTYPE=DEL;SVLEN=-62000;

These two will not be merged, because their END position differs too much. You may avoid a few of these cases by changing the --overlap and --bnd_distance parameters, but there will always be a risk of getting a few of these.
Some --priority flag option sounds like a good idea, I will think about it!
In essence we would need some nice way of representing these SV as multiallelic sites.

I just made a fix to SVDB, if your vcf files have "contig entries" in the vcf-header, the chromosomes should be sorted correctly!
You are correct, otherwise SVDB will sort the chromosomes by lexiographic order.

Thanks! Happy to hear that the tool works for you! It would be great to know if you find any bugs or ideas of new features!
Regards
//Jesper

@dnil
Copy link
Contributor

dnil commented Nov 20, 2019

Bump on this one due to a recent MIP issue, although the use case was somewhat unexpected. I could perhaps phrase this as a question: should the docs perhaps state that SVDB merge require input to be split (and normalised)? I have a feeling it won't really handle many-way merging of multi-allelic input entries confidently?

@sjurug
Copy link

sjurug commented Apr 24, 2020

Hi, @J35P312

We also experience a similar problem

We observed that overlapping variants are not merged when trying to merge sequence resolved and non-sequence resolved variants from Truvari.

I think that this is what happens

  • Truvari removes the END info field for sequence resolved DEL variants
  • In svdb/readVCF.py, this leads to END being estimated as posB=posA+int(description["SVLEN"]), but for DEL case SVLEN is negative
  • Later, this leads to posA and posB being interchanged.
  • The new variant is no longer overlapping the original variant

Intuitively, I would change posB=posA+int(description["SVLEN"]) -> posB=posA+abs(int(description["SVLEN"])), but maybe you could have a look at it?
Maybe also add support for the cases when sequence resolved variants have neither END nor SVLEN?

@J35P312
Copy link
Owner

J35P312 commented Apr 24, 2020

Woopsiedaisy!
And thanks for your suggestions!

could you send me the variants you are trying to merge?
Then I will look in to this!

//JEsper

@sjurug
Copy link

sjurug commented Apr 24, 2020

Thank you for looking into this!

I have tried to make a minimal test set consisting of

  • Manta-style sequence resolved variant without END
  • Delly-style non-sequence resolved variant with END
  • Result from SVDB merge

I have been able to get the expected overlap by either

@J35P312
Copy link
Owner

J35P312 commented Apr 27, 2020

Hello!
I have had a look, thanks for the test data!
I notice you are correct, I have edited the readvcf function to this:

posB=posA+abs(int(description["SVLEN"]))

and now the variants merge nicely!

1 95690510 MantaDEL:manta_95690510|DEL00000389:delly_95690510 AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT A994 PASS SVTYPE=DEL;SVLEN=-319;END=95690830;VARID=DEL00000389:delly_95690510;set=Intersection GT 1/1

Feel free to give this a try!
//JEsper

@sjurug
Copy link

sjurug commented Apr 28, 2020

Thank you!

@sjurug
Copy link

sjurug commented Aug 16, 2021

Hi again,

It seems that also INS is affected by a similar issue

  • two INS with different POS will not merge despite equal lengths
  • I believe that it is caused by SVDB defaulting to posB=END internally instead of defaulting to posB=POS+abs(SVLEN) in readVCF.py
  • If using posB=POS+abs(SVLEN) as default (and posB=END if there is no SVLEN), the variants will merge
  • I belive that the origin of this problem is that END is a coordinate in the reference, while SVLEN is measured in some other coordinate system. Internally in SVDB this distinction does not matter, but when exporting the database-VCF, we get END=POS+abs(SVLEN), which does not work for INS. Again this does not matter when using SVDB for querying, but even after the suggested fix it could potentially lead to some confusion for people investigating the VCF manually.

Attached are tree INSes that only merge if I use the suggested fix. I have tested this

  • merging the variants
  • making a VCF-database of two INSes
  • querying the third INS using the VCF-database

svdb_16_testdata_ins.tar.gz

@J35P312
Copy link
Owner

J35P312 commented Sep 1, 2021

Hello!
Sorry for the delay!
I have had a look, and implemented separate treatment of the INS variants. now PosB=PosA (i.e all ins will have lenght 0 in the reference); in essence, the insertions will be treated as single points. I also added a new parameter "--ins_distance" allowing the insertions to be clustered with a different distance than bnd_distance.

these changes are found in the branch "insertion_distance". Feel free to try it!

To me this solves the issue of END=POS+abs(SVLEN), insertions not being merged properly etc in a good enough way.
Ideally, we should merge based on sequence similarity, but this is tricky since different callers report insertions in such different formats.

according to these rules, all the insertions in the test data get merged.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002 HG003 HG004
1 95690510 MantaINS:test_ins1|MantaINS:test_ins2|MantaINS:test_ins3 A AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT 994 PASS END=95690510;SVTYPE=INS;SVLEN=319;VARID=MantaINS:test_ins2|MantaINS:test_ins3;set=Intersection GT 1/1 1/1 1/1

Do you agree with these rules? Feel free to comment! I will do some tests on mobile element insertions etc, probably I will merge the insertion_distance by the end of the week!

@sjurug
Copy link

sjurug commented Sep 9, 2021

Thanks!
I think that this solution is a good first approximation. If we are really afraid of over-estimating the frequency, we can use a very small value such as --ins_distance=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants