Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large deletions + VCF Evidence #12

Merged
merged 16 commits into from
Apr 13, 2023
Merged

Large deletions + VCF Evidence #12

merged 16 commits into from
Apr 13, 2023

Conversation

JeremyWesthead
Copy link
Collaborator

Refined logic for gene level deletions - if a deletion spans >1 gene, it is now detected within the downstream gene too.
Added percentage deletion of genes in cases where the deletion deletes >= 50% of the gene
Added VCF evidence for each mutation (both at genome and gene level)
Several bug fixes such as ensuring all TB genes are valid gene names according to the Gene.valid_variant method

@philipwfowler
Copy link
Member

Have tried out the branch. Few thoughts before I merge as probably makes more sense to correct in branch first:

  • we need to decide what to do about minor populations v het calls. The former being e.g. a MIN_FRS fail with a 0/0 call despite there being good evidence of a minor population whilst the latter is when the variant caller decides to go with 0/1 since there is good evidence of a minor population. What seems to happen at the moment is if there is a het (0/1 call) but the position is listed in minor_population_indices then the variant is reported with a z. If it is listed then there is also an entry in diff.minor_populations(). The problems come when there is a het call that doesn't include wildtype e.g. if it is 1/2 then the minor populations are reported as two entries in the list w.r.t the wild type which makes sense but I'd have to catch downstream and notice? This could be as simple as halting or not noting the minor population in these cases? As we are downstream of Clockwork and it isn't doing any het calls, then this behaviour shouldn't be exposed, but I think we should protect against it for other uses.
  • If you have a long deletion (say) then the sample.vcf_evidences list contains one entry per base that is deleted in the reference, rather than just one entry at the 'start' of the deletion but the Genome difference object gets it right and only has a single entry which makes sense since we'd only have one row in the equivalent of the VARIANTS table for the deletion. Is this deliberate / unavoidable?

@philipwfowler philipwfowler merged commit 189b6eb into master Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants