Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out multi-allelic indels that start at different positions into separate records #2917

Closed
droazen opened this issue Jun 5, 2017 · 1 comment
Assignees

Comments

@droazen
Copy link
Contributor

droazen commented Jun 5, 2017

@vdauwera commented on Tue Nov 15 2016

@vdauwera commented on Sat Mar 07 2015

Orginially from @eitanbanks

For example, we should split:

pos=N ref=AA alt=A,AAA

into:

pos=N ref=AA alt=A
pos=N+1 ref=A alt=AA

If we don't do this then we're going to run into problems with GenotypeGVCFs. Imagine if sample1 is multi-allelic and has the original record above at position N; and sample2 is bi-allelic and only has the insertion (so its record would be at position N+1). Because GenotypeGVCFs runs over each position in the gVCF/NAVS, it will genotype the same insertion separately for the 2 samples (because they occur in records at different positions).


@vdauwera commented on Thu Jul 16 2015

May be solved by the spanning deletion fix. @eitanbanks do you still want methods to look at this? They need a concrete example.


@eitanbanks commented on Fri Jul 17 2015

This is not solved by the spanning deletions fix. Do you want me to create two sample gVCFs that illustrate this problem?


@ldgauthier commented on Fri Jul 31 2015

Here's the example Eric came up with when we discussed this:
cam00218
(I have no idea why Github rotated this)
If there is a het-non-ref sample (like S1), its alleles can be represented differently in the gVCF than a sample with a bi-allelic variant. Then when they get genotyped together, the same allele can show up at two different positions in the combined VCF, i.e. the T insertion is listed at position 325 for S1 and 326 for S2, but it's the same variant. This is probably what happened in the ExAC example (#1072).
It would be great for someone to write a HC (gVCF mode) unit test for this with some artificial reads so we can start working on a splitting procedure.


@vdauwera commented on Mon Nov 14 2016

Does anyone still care about this? If so, should it go into the GATK4 repo?


@ldgauthier commented on Tue Nov 15 2016

I care, I just don't have the bandwidth to work on it. Please move to GATK4.

@ldgauthier
Copy link
Contributor

After looking at this again (years later), I don't think this is a real thing. Whether they occur in the same sample or not, variants get represented based on the SW alignment of the assembled haplotype to the reference haplotype. The S1 GGAGTC allele will get aligned to the reference as a G->GT insertion. When it's genotyped against the reads from the other haplotype it probably used to exhibit different behavior because HC didn't have spanning deletions, but thanks to Chris W. now it does! (#4963) I'm closing this since we don't have a real test case and I don't believe our whiteboard scribbles anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants