Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INFO field 'END' is before varaint's 'POS' for variant created via Smoove SV population calling #56

Open
WimSpee opened this issue Feb 5, 2019 · 11 comments

Comments

@WimSpee
Copy link

WimSpee commented Feb 5, 2019

Hi,

I was able to use the Smoove population SV calling pipeline to SV call 200+ samples.
SVs that we are interested in were also included in the results.
So the pipeline worked fast and the results are good based on our first experience with Smoove.

I ran into one issue while trying to annotate the SV's via SnpEff.

java.lang.RuntimeException: java.lang.RuntimeException: INFO field 'END' is before varaint's 'POS'
        END : 3869
        POS : 3923
        at org.snpeff.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:115)
        at org.snpeff.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:166)
        at org.snpeff.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:56)
        at org.snpeff.fileIterator.FileIterator.hasNext(FileIterator.java:123)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.annotateVcf(SnpEffCmdEff.java:449)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.annotate(SnpEffCmdEff.java:140)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:992)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:947)
        at org.snpeff.SnpEff.run(SnpEff.java:1009)
        at org.snpeff.SnpEff.main(SnpEff.java:155)
Caused by: java.lang.RuntimeException: INFO field 'END' is before varaint's 'POS'
        END : 3869
        POS : 3923
        at org.snpeff.vcf.VcfEntry.parseEnd(VcfEntry.java:1092)
        at org.snpeff.vcf.VcfEntry.parse(VcfEntry.java:961)
        at org.snpeff.vcf.VcfEntry.<init>(VcfEntry.java:220)
        at org.snpeff.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:112)

https://github.com/pcingola/SnpEff/blob/87976d2c63fc2590408cc40f41b96818079324b9/src/main/java/org/snpeff/vcf/VcfEntry.java#1159

I could trace this back to this record in the final VCF file produced via Smoove paste. In this record END is indeed before POS.

Chr_01  3924 92939   N       <DEL>   29.87   .       SVTYPE=DEL;SVLEN=54;END=3870;STRANDS=+-:31;IMPRECISE;CIPOS=-182,109;CIEND=-120,194;CIPOS95=-157,109;CIEND95=-120,153;SU=31;PE=31;SR=0;SNAME=Sample_46:149,Sample_79:52;ALG=SUM;AN=462;AC=1

For Sample_79 I could find back this records in the result from smoove call. This variants has the same start position, but length 48 instead of 54 and END is after POS.

Chr_01  3924 52      N       <DEL>   89.31   .       SVTYPE=DEL;SVLEN=-48;END=3972;STRANDS=+-:11;IMPRECISE;CIPOS=-30,105;CIEND=-47,88;CIPOS95=0,104;CIEND=-47,88;CIPOS95=0,104;CIEND95=-45,59;SU=11;PE=11;SR=0

In the merged.sites.vcf.gz VCF file produced with smoove the END before POS issue can first be found:

Chr_01  3924 92939   N       <DEL>   0       .       SVTYPE=DEL;SVLEN=54;END=3870;STRANDS=+-:31;IMPRECISE;CIPOS=-182,109;CIEND=-120,194;CIPOS95=-157,109;CIEND=-120,194;CIPOS95=-157,109;CIEND95=-120,153;SU=31;PE=31;SR=0

I could not find in the VCF spec if END is always required to be after POS.
https://samtools.github.io/hts-specs/VCFv4.2.pdf

SnpEff seems to take this strict. As far as I know at least (small) variants including indels are always reported on the positive strand with END after than POS. Do you know if this is/should also be the case for SV's?

And maybe check if this issue can be fixed in the smoove merge code.

Thank you.

@WimSpee
Copy link
Author

WimSpee commented Feb 5, 2019

his is issue is based on files created with Smoove version 0.1.9.
I still need to install Smoove version 0.2.3.

@brentp
Copy link
Owner

brentp commented Feb 5, 2019

yes, this should be fixed in 0.2.3

@WimSpee
Copy link
Author

WimSpee commented Feb 5, 2019

Ok thank you.

@WimSpee
Copy link
Author

WimSpee commented Feb 12, 2019

Hi Brent. I upgraded to Smoove 0.2.3 and ran another set of samples. I still get multiple 'END' is before variants 'POS' errors (I run the annotation per chromosomes, some chromosomes finish without this error, some have this error. So the bug is I guess caused by a somewhat rare edge condition )

See also the error below and the Smoove version from VCF header.

##smoove_version=0.2.3

VcfFileIterator.parseVcfLine(114):      Fatal error reading file '-' (line: 5491):
Chr_XX  41741        16387   N       <DEL>   44.83   .       SVTYPE=DEL;SVLEN=26;END=41715;STRANDS=+-:10;IMPRECISE;CIPOS=-77,20;CIEND=-20,77;CIPOS95=-31,20;CIEND95=-20,55;SU=10;PE=10;SR=0;SNAME=SXS_101913_26:1866,SXS_101701_05:1762;ALG=SUM;GCF=0;AN=162;AC=14

java.lang.RuntimeException: java.lang.RuntimeException: INFO field 'END' is before varaint's 'POS'
        END : 41714
        POS : 41740
        at org.snpeff.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:115)
        at org.snpeff.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:166)
        at org.snpeff.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:56)
        at org.snpeff.fileIterator.FileIterator.hasNext(FileIterator.java:123)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.annotateVcf(SnpEffCmdEff.java:449)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.annotate(SnpEffCmdEff.java:140)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:992)
        at org.snpeff.snpEffect.commandLine.SnpEffCmdEff.run(SnpEffCmdEff.java:947)
        at org.snpeff.SnpEff.run(SnpEff.java:1009)
        at org.snpeff.SnpEff.main(SnpEff.java:155)
Caused by: java.lang.RuntimeException: INFO field 'END' is before varaint's 'POS'

Number of samples and variants via bcftools stats

SN      0       number of samples:      204
SN      0       number of records:      50929

@brentp
Copy link
Owner

brentp commented Feb 12, 2019

thanks for reporting, I'll have a look.

@WimSpee
Copy link
Author

WimSpee commented Feb 12, 2019

Thank you.

brentp added a commit to brentp/svtools that referenced this issue Feb 12, 2019
@brentp
Copy link
Owner

brentp commented Feb 12, 2019

I opened a PR to fix this in svtools. Should be able to get that in the next release.
What are you using to set the REF allele? I had meant to add that to smoove as well.

@WimSpee
Copy link
Author

WimSpee commented Feb 12, 2019

Hi Brent. Thank you for opening the PR at svtools. I am running vanilla Smoove 0.2.3, following exactly the Smoove commands under the population section of the documentation on this github page.

I am not using anything to set the REF allele. The REF allele just is N in the example above and also elsewhere in the VCF that Smoove made on our data. Or I am missing something / understanding it not fully.

@brentp
Copy link
Owner

brentp commented Feb 12, 2019

got it. I was thinking that snpEff wouldn't work without having the REF allele set properly but apparently that's not (any longer?) the case.

@WimSpee
Copy link
Author

WimSpee commented Feb 12, 2019

snpEff worked for us in the past to annotate SV VCF files made with Lumpy and Manta.
The documentation also says SVs are somewhat supported.

Variants supported 	SnpEff can annotate SNPs, MNPs, insertions and deletions. Support for mixed variants and structural variants is available (although sometimes limited). 

http://snpeff.sourceforge.net/SnpEff_manual.html

I do post process the snpEff annotated SV VCF file with a small CYVCF2 script to collapse very large ANN fields in to just the set or the number of genes hit (otherwise the ANN field becomes way to big to use /make sense of for large SVs).

@WimSpee
Copy link
Author

WimSpee commented Aug 26, 2019

Thanks for pinging at svtools. As a temporary workaround we implemented the following small cyvcf2 script to post process the Smoove results. Maybe this is useful for someone else until the svtools update is made.

smoove_bcf_path = "/data/some_path/DA_1355.smoove.square.bcf"

from cyvcf2 import VCF
vcf_reader = VCF(smoove_bcf_path)
print(vcf_reader.raw_header, end='')

for variant in vcf_reader:    
    pos = variant.POS
    end = variant.INFO.get('END')
    end_cyvcf2 = variant.end

    if not end:
        print(variant, end='')
    else:            
        if end < pos:
            str_variant = str(variant)
            new_str_variant = str_variant.replace(variant.CHROM+"\t"+str(variant.POS), variant.CHROM+"\t"+str(variant.INFO.get('END')))
            new_str_variant2 = new_str_variant.replace("END="+str(variant.INFO.get('END')), "END="+str(variant.POS))
            print(new_str_variant2, end='')            
        else:
            print(variant, end='')        

        if end != end_cyvcf2:
            print("end %s and cyvcf2 end  %sare not the same" %  (str(end), str(end_cyvcf2)))
            exit()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants