-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNV VCFs could be improved #6167
Comments
Thanks for the suggestions @tfenne. I think we have considered 2 in the past, but just haven’t gotten around to implementing it yet. As for 4, output of the denoised copy ratios was recently added, but users may have to do a bit of work to grab those that overlap a particular interval. As you allude to, conventions for CNV VCFs are not as settled—more than happy to take suggestions like these from the community. |
For manual review, @asmirnov239 was working on normalized coverage plots, right? |
@tfenne I've been putting in some work cleaning up the gCNV VCFs and I have fixes for your requests 1-3. For 4, do you know what you'd like in the VCF? As you know, most of the events are going to have multiple targets, and each target has different coverage, so are you thinking mean? Median? I was also contemplating putting more info in the intervals VCF since things like target coverage make more sense at the interval level. |
Dear all, I am a bit confused why GATK uses As it is right now, if I read a VCF from GATK CNV germline pipeline through
Any reason to not use the standaed I have also noticed that GATK outputs some non-variable SVs to the VCF without any ALT allele. Why not remove them if they are actually not SVs, if thanks, |
@fgvieira would diploid no-call genotypes work in your pipeline? The example in the VCF spec (http://samtools.github.io/hts-specs/VCFv4.2.pdf page 11) has no-call diploid genotypes, GQ=0 with copy number (CN) and copy number quality (CNQ) specified. These aren't allelic calls so we can't say how many copies are on each haplotype. I think the haploid genotype calls were supposed to be for convenience. @cwhelan what do we do (or should we do) for depth-only CNV calls in the WGS SV pipeline? We output CN2 calls so that they can be interpreted in the context of a larger cohort to calculate site frequency or to assess carrier status for family studies. Right now the "joint calling" for exome copy number is under active development, but we leave the reference calls for ad hoc joint analysis. |
In the WGS SV pipeline, for deletions and duplications that the pipeline believes to be biallelic we do the following:
We currently report depth based copy number and quality for these variants in custom format fields For multiallelic CNVs (i.e. sites where our model is not sure that the variant is bi-allelic) we write:
I think there are some tradeoffs in completely characterizing the evidence for and quality of each call and enabling easy searching across the whole VCF without having to parse and understand the entire record. Older versions of our pipeline used to put the diploid copy number of the event into the GT field, I think similarly to what's being described above. This is incorrect VCF -- GT values should be indices into the allele list for the variant, and should be a list of length equal to the ploidy. My view is that if you can confidently infer the alleles present at the site in the sample set you should use a GT value of the form |
I like @cwhelan suiggestions, where we'd have:
|
We should apply the changes made to the segments VCF in #6352 to the intervals VCF to keep them consistent. See https://gatk.broadinstitute.org/hc/en-us/community/posts/360071515912-GATK4-1-6-0-gCNV-inconsistent-CNV-calls-in-intervals-and-segments-vcfs |
Stumbled back on this when looking at #6924. @ldgauthier @mwalker174 was the above comment addressed? Might be good to verify the gCNV tutorial is still consistent or update it at some point. |
Feature request
Tool(s) or class(es) involved
GermlineCNVCaller / PostprocessGermlineCNVCalls
Description
The VCF produced by the germline CNV calling workflow could be nicer. TBH the VCF output feels like a bit of an afterthought compared to the other outputs. This seems common for CNV callers, but I was hoping the VCF produced by the GATK would be more complete.
Things that would make the VCF easier to use/interpret with downstream tooling:
##contig
lines in the header.PostprocessGermlineCNVCalls
takes in a sequence dictionary, and I was surprised that isn't used in generating the VCF.<DUP>
and<DEL>
as alts even though the VCF is single-sample and a given sample can only be duplicated or deleted. This makes quick text-searching of the VCF difficult and means one has to parse the genotypes to determine if the record represents a duplication or deletion in the sample..
for all events. There are various quality scores in the FORMAT/GENOTYPE fields. It would be nice if either the preferred one of those or some other quality measure could be emitted into the QUAL field.The text was updated successfully, but these errors were encountered: