Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update consequence severity filtering #326

Merged
merged 8 commits into from
May 13, 2022

Conversation

apriltuesday
Copy link
Contributor

@apriltuesday apriltuesday commented May 9, 2022

Closes #321

  • Update strategy for filtering consequences based on severity
    • Genes overlapping variant are all reported, using the most severe consequence per gene
    • If no genes overlap, use overall most severe consequence and report all genes with that consequence (previous behaviour)
  • Refactor to remove unused code (distant VEP querying) and dataframe columns for consequence prediction

To get a sense of the additional consequences the new strategy gives us, you can look at this commit which shows the diff in the VEP test file for SNPs, as well as some of the other tests.

@apriltuesday apriltuesday marked this pull request as ready for review May 9, 2022 11:06
@apriltuesday apriltuesday self-assigned this May 9, 2022
@apriltuesday apriltuesday requested review from M-casado and tcezard May 9, 2022 11:07
Copy link
Collaborator

@M-casado M-casado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few comments out of curiosity.

If I understood correctly, 87 new evidence strings (of overlapping genes) would be reported as of today with this modification, correct?

VEP_SHORT_QUERY_DISTANCE = 5000
VEP_LONG_QUERY_DISTANCE = 500000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason behind no longer querying VEP twice if the short distance doesn't find any gene? Just curious

@@ -97,54 +88,58 @@ def load_consequence_severity_rank():
return {term: index for index, term in enumerate(get_severity_ranking())}


def extract_consequences(vep_results, acceptable_biotypes, only_closest, results_by_variant, report_distance=False):
def extract_consequences(vep_results, acceptable_biotypes):
"""Given VEP results, return a list of consequences matching certain criteria.

Args:
vep_results: results obtained from VEP for a list of variants, in JSON format.
acceptable_biotypes: a list of transcript biotypes to consider (as defined in Ensembl documentation, see
https://www.ensembl.org/info/genome/genebuild/biotypes.html). Consequences for other transcript biotypes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, among all biotypes, we only consider miRNAs and protein coding genes? Sorry again, has not much to do with this PR but I'm curious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question and it seems to have been decided long ago (before the birth of this repo).
There are a few others that we could consider in the non-coding genes, the IG genes and the TR genes.
Probably. a topic to raise with OpenTarget

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #328 for this, we can discuss at our next meeting.

@apriltuesday
Copy link
Contributor Author

Actually that's just 87 new evidence strings for the set of 2000 variants used in the test for the original VEP pipeline (based on coordinates, mostly SNPs / short indels). You can see this test for a much smaller set used in testing the current structural pipeline.

For your other questions, they're from before my time as well (and actually before the time of this repo, so not even in the history!) so we'd have to ask @tskir or @tcezard... I'm also curious about this.

@@ -97,54 +88,58 @@ def load_consequence_severity_rank():
return {term: index for index, term in enumerate(get_severity_ranking())}


def extract_consequences(vep_results, acceptable_biotypes, only_closest, results_by_variant, report_distance=False):
def extract_consequences(vep_results, acceptable_biotypes):
"""Given VEP results, return a list of consequences matching certain criteria.

Args:
vep_results: results obtained from VEP for a list of variants, in JSON format.
acceptable_biotypes: a list of transcript biotypes to consider (as defined in Ensembl documentation, see
https://www.ensembl.org/info/genome/genebuild/biotypes.html). Consequences for other transcript biotypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question and it seems to have been decided long ago (before the birth of this repo).
There are a few others that we could consider in the non-coding genes, the IG genes and the TR genes.
Probably. a topic to raise with OpenTarget

@apriltuesday apriltuesday merged commit a11b9cf into EBIvariation:master May 13, 2022
@apriltuesday apriltuesday deleted the issue-321 branch May 13, 2022 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit behaviour of most severe consequence filtering
3 participants