Split out multi-allelic indels that start at different positions into separate records #2917

droazen · 2017-06-05T18:04:17Z

@vdauwera commented on Tue Nov 15 2016

@vdauwera commented on Sat Mar 07 2015

For example, we should split:

pos=N ref=AA alt=A,AAA

into:

pos=N ref=AA alt=A
pos=N+1 ref=A alt=AA

If we don't do this then we're going to run into problems with GenotypeGVCFs. Imagine if sample1 is multi-allelic and has the original record above at position N; and sample2 is bi-allelic and only has the insertion (so its record would be at position N+1). Because GenotypeGVCFs runs over each position in the gVCF/NAVS, it will genotype the same insertion separately for the 2 samples (because they occur in records at different positions).

@vdauwera commented on Thu Jul 16 2015

May be solved by the spanning deletion fix. @eitanbanks do you still want methods to look at this? They need a concrete example.

@eitanbanks commented on Fri Jul 17 2015

This is not solved by the spanning deletions fix. Do you want me to create two sample gVCFs that illustrate this problem?

@ldgauthier commented on Fri Jul 31 2015

Here's the example Eric came up with when we discussed this:

(I have no idea why Github rotated this)
If there is a het-non-ref sample (like S1), its alleles can be represented differently in the gVCF than a sample with a bi-allelic variant. Then when they get genotyped together, the same allele can show up at two different positions in the combined VCF, i.e. the T insertion is listed at position 325 for S1 and 326 for S2, but it's the same variant. This is probably what happened in the ExAC example (#1072).
It would be great for someone to write a HC (gVCF mode) unit test for this with some artificial reads so we can start working on a splitting procedure.

@vdauwera commented on Mon Nov 14 2016

Does anyone still care about this? If so, should it go into the GATK4 repo?

@ldgauthier commented on Tue Nov 15 2016

I care, I just don't have the bandwidth to work on it. Please move to GATK4.

The text was updated successfully, but these errors were encountered:

ldgauthier · 2018-10-05T14:50:28Z

After looking at this again (years later), I don't think this is a real thing. Whether they occur in the same sample or not, variants get represented based on the SW alignment of the assembled haplotype to the reference haplotype. The S1 GGAGTC allele will get aligned to the reference as a G->GT insertion. When it's genotyped against the reads from the other haplotype it probably used to exhibit different behavior because HC didn't have spanning deletions, but thanks to Chris W. now it does! (#4963) I'm closing this since we don't have a real test case and I don't believe our whiteboard scribbles anymore.

droazen mentioned this issue Jun 5, 2017

Split out multi-allelic indels that start at different positions into separate records broadinstitute/gatk-protected#787

Closed

droazen assigned ldgauthier Jun 5, 2017

ldgauthier closed this as completed Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split out multi-allelic indels that start at different positions into separate records #2917

Split out multi-allelic indels that start at different positions into separate records #2917

droazen commented Jun 5, 2017

ldgauthier commented Oct 5, 2018

Split out multi-allelic indels that start at different positions into separate records #2917

Split out multi-allelic indels that start at different positions into separate records #2917

Comments

droazen commented Jun 5, 2017

ldgauthier commented Oct 5, 2018