Extract Cohort optimizations [VS-493] [VS-1516] #9055

mcovarr · 2024-11-26T15:01:48Z

Integration test run here.

Follows the bread crumbs in VS-493 to drop filter info and sites outside of variant locations for the samples being extracted.

Test runs here:

Findings:

This data set is the biggest non-AoU dataset we have but is decidedly on the small side for being able to measure the effects of changes like this.
For the particular runs linked above, the code on this branch dropped ~85% of unnecessary filter info.
Presumably because there is less work being done, these extracts run more quickly than the baseline code.
Questionable if these changes would help with "small subsets". In the "small subset" use cases we extract all samples, though only the variant data over a specified interval list. I don't think the current logic is informed by interval lists; we should look into this further.

gbggrant · 2024-11-26T15:56:35Z

src/main/java/org/broadinstitute/hellbender/tools/gvs/extract/ExtractCohortEngine.java

@@ -86,6 +85,15 @@ public class ExtractCohortEngine {

    private final Consumer<VariantContext> variantContextConsumer;

+    private static class VariantIterables {
+        public Iterable<GenericRecord> vets;


Maybe we should brainstorm on a better name for the vets table and related? It now collides with the new name for VQSR Lite

ugh yes, good point

gbggrant

Looks good!

koncheto-broad

LGTM
Approved contingent on correctness comparison in new runs

koncheto-broad · 2024-11-26T20:45:41Z

the correctness comparisons I mentioned are between your subcohort BGE extracts from the WGS 3k callset pulling from ah_var_store and this branch. And in theory you can also look at memory usage between them and document it to see if the code affects sub-region extracts as well as subcohort extracts (although those results will not gate this PR being merged)

mcovarr · 2024-11-27T13:51:31Z

I did do the BGE correctness comparisons mentioned above and everything tied out perfectly wrt ah_var_store, dropping > 99% of filter set info and > 98% of filter set sites. The runtimes of these extracts are even shorter than they were for the WGS dataset so the graphs are not going to be terribly informative. I'm thinking to reach out to see if we can run this code against a larger AoU dataset after the break.

koncheto-broad · 2024-12-10T20:11:48Z

Since we all agree that these changes are functionally sound and beneficial, I'm going to merge this PR and pull the full evaluation of the results into another ticket and PR

mcovarr added 10 commits November 25, 2024 18:22

wip

26322e2

more dropping, more logging

01a9bef

gah

399d627

Docker

83879b0

yolo

9330c4f

Docker

6eaf6c3

Docker

0f05a8c

minor fixes, not worth rebuilding Docker

76bd38c

minor updates

5a8e612

Docker

f4a945c

mcovarr assigned gbggrant and koncheto-broad Nov 26, 2024

gbggrant reviewed Nov 26, 2024

View reviewed changes

gbggrant approved these changes Nov 26, 2024

View reviewed changes

koncheto-broad approved these changes Nov 26, 2024

View reviewed changes

mcovarr added 6 commits November 27, 2024 10:40

begin additional restructuring

77ce106

cleanup of existing code

bdefb19

Docker

4642a58

Docker

6e8bd89

windowing comments wip

544db48

windowing comments

744ed2a

koncheto-broad merged commit c068217 into ah_var_store Dec 10, 2024
16 of 18 checks passed

koncheto-broad deleted the vs_1516_yolo branch December 10, 2024 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Cohort optimizations [VS-493] [VS-1516] #9055

Extract Cohort optimizations [VS-493] [VS-1516] #9055

mcovarr commented Nov 26, 2024 •

edited

Loading

gbggrant Nov 26, 2024

mcovarr Nov 26, 2024

gbggrant left a comment

koncheto-broad left a comment

koncheto-broad commented Nov 26, 2024

mcovarr commented Nov 27, 2024

koncheto-broad commented Dec 10, 2024

Extract Cohort optimizations [VS-493] [VS-1516] #9055

Extract Cohort optimizations [VS-493] [VS-1516] #9055

Conversation

mcovarr commented Nov 26, 2024 • edited Loading

gbggrant Nov 26, 2024

Choose a reason for hiding this comment

mcovarr Nov 26, 2024

Choose a reason for hiding this comment

gbggrant left a comment

Choose a reason for hiding this comment

koncheto-broad left a comment

Choose a reason for hiding this comment

koncheto-broad commented Nov 26, 2024

mcovarr commented Nov 27, 2024

koncheto-broad commented Dec 10, 2024

mcovarr commented Nov 26, 2024 •

edited

Loading