Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Cohort optimizations [VS-493] [VS-1516] #9055

Merged
merged 16 commits into from
Dec 10, 2024

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Nov 26, 2024

Integration test run here.

Follows the bread crumbs in VS-493 to drop filter info and sites outside of variant locations for the samples being extracted.

Test runs here:

Findings:

  • This data set is the biggest non-AoU dataset we have but is decidedly on the small side for being able to measure the effects of changes like this.
  • For the particular runs linked above, the code on this branch dropped ~85% of unnecessary filter info.
  • Presumably because there is less work being done, these extracts run more quickly than the baseline code.
  • Questionable if these changes would help with "small subsets". In the "small subset" use cases we extract all samples, though only the variant data over a specified interval list. I don't think the current logic is informed by interval lists; we should look into this further.

@@ -86,6 +85,15 @@ public class ExtractCohortEngine {

private final Consumer<VariantContext> variantContextConsumer;

private static class VariantIterables {
public Iterable<GenericRecord> vets;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should brainstorm on a better name for the vets table and related? It now collides with the new name for VQSR Lite

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh yes, good point

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link

@koncheto-broad koncheto-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Approved contingent on correctness comparison in new runs

@koncheto-broad
Copy link

the correctness comparisons I mentioned are between your subcohort BGE extracts from the WGS 3k callset pulling from ah_var_store and this branch. And in theory you can also look at memory usage between them and document it to see if the code affects sub-region extracts as well as subcohort extracts (although those results will not gate this PR being merged)

@mcovarr
Copy link
Collaborator Author

mcovarr commented Nov 27, 2024

I did do the BGE correctness comparisons mentioned above and everything tied out perfectly wrt ah_var_store, dropping > 99% of filter set info and > 98% of filter set sites. The runtimes of these extracts are even shorter than they were for the WGS dataset so the graphs are not going to be terribly informative. I'm thinking to reach out to see if we can run this code against a larger AoU dataset after the break.

@koncheto-broad
Copy link

Since we all agree that these changes are functionally sound and beneficial, I'm going to merge this PR and pull the full evaluation of the results into another ticket and PR

@koncheto-broad koncheto-broad merged commit c068217 into ah_var_store Dec 10, 2024
16 of 18 checks passed
@koncheto-broad koncheto-broad deleted the vs_1516_yolo branch December 10, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants