-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Cohort optimizations [VS-493] [VS-1516] #9055
Conversation
@@ -86,6 +85,15 @@ public class ExtractCohortEngine { | |||
|
|||
private final Consumer<VariantContext> variantContextConsumer; | |||
|
|||
private static class VariantIterables { | |||
public Iterable<GenericRecord> vets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should brainstorm on a better name for the vets table and related? It now collides with the new name for VQSR Lite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh yes, good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Approved contingent on correctness comparison in new runs
the correctness comparisons I mentioned are between your subcohort BGE extracts from the WGS 3k callset pulling from ah_var_store and this branch. And in theory you can also look at memory usage between them and document it to see if the code affects sub-region extracts as well as subcohort extracts (although those results will not gate this PR being merged) |
I did do the BGE correctness comparisons mentioned above and everything tied out perfectly wrt ah_var_store, dropping > 99% of filter set info and > 98% of filter set sites. The runtimes of these extracts are even shorter than they were for the WGS dataset so the graphs are not going to be terribly informative. I'm thinking to reach out to see if we can run this code against a larger AoU dataset after the break. |
Since we all agree that these changes are functionally sound and beneficial, I'm going to merge this PR and pull the full evaluation of the results into another ticket and PR |
Integration test run here.
Follows the bread crumbs in VS-493 to drop filter info and sites outside of variant locations for the samples being extracted.
Test runs here:
Findings: