Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I got different results on InDel extraction by using vcftools and GATK-SelectVariants #7100

Open
woaishiye opened this issue Feb 24, 2021 · 1 comment

Comments

@woaishiye
Copy link

Affected tool(s) or class(es)

GATK
VCFTOOLS

Affected version(s)

GATK-v4.1.1.0
VCFTOOLS-v0.1.15

Description

I used vcftools (remove-indels and keep-only-indels) and gatk-selectvariants to extract SNPs and InDels from a gatk-GenotypeGVCFs generated original vcf file, but I got different results. The number of SNPs extracted by these two softwares was the same, but the number of InDels extracted by these two softwares was different. In my opinion, in the original VCF file, there are only two types of variants, SNP and InDel, and the number of SNPs plus the numbers InDels should be equal to the variant number in the original vcf file. For VCFtools, SNPS plus InDels equal all the variant number in the original vcf file, but gatk-selectvariants not. I am wondering, if there are some special rules for gatk-selectvariants function when extracting InDels from vcf file, leading to its number smaller than expected. Any help will be appreciated. Below are two pictures generated by gatk-selectvariants (left) and vcftools (right).
图片
图片

Steps to reproduce

java -Xmx3990m -Djava.io.tmpdir=./JavaTmpDir
-jar gatk-package-4.1.1.0-local.jar SelectVariants
-R ./a_a_ref/Gmax_275_v2.0.fa
-V original.vcf.gz
-select-type INDEL
-O original.InDel.vcf.gz

$VCFTOOLS --gzvcf original.vcf.gz
--keep-only-indels --out original.InDel
--recode --recode-INFO-all

Expected behavior

we should get a same number of InDels by using VCFtools and GATK-selectvariants.

Actual behavior

the InDel number generated by gatk-selectvariants was smaller than vcftools.

@kachulis
Copy link
Contributor

kachulis commented Mar 1, 2021

@woaishiye some of the sites that vcftools outputs as indels but SelectVariants does not are excluded because SelectVariants considers them mixed sites of both SNPs and INDELs (chr1:37 for example). However other sites, like chr1:61, seem to be highlighting an issue with how the spanning deletion allele (*) is treated in htsjdk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants