Size similarity linkage and bug fixes for SV matching tools #8257

mwalker174 · 2023-03-22T18:25:27Z

Adds size similarity criterion to SVConcordance and SVCluster tools. This is particularly useful for accurately matching smaller SVs that have a high degree of breakpoint uncertainty, in which case reciprocal overlap does not work well. PESR/mixed variant types must have size similarity, reciprocal overlap, and breakend window criteria met. Depth-only variants may have either size similarity + reciprocal overlap OR breakend window criteria met (or both).
Rewrites some of the linkage logic to be simpler to read.
Fixes a rare bug with SortedMultiset in SVClusterEngine that sometimes caused records with identical start positions to get lost.
Removes null record attributes to avoid . INFO/FORMAT fields, which cause a parsing error with Integer types.
Add check that the vcf header contigs are sorted in the same order.
Retain FILTER and QUAL fields in output.

gatk-bot · 2023-04-30T17:43:57Z

Github actions tests reported job failures from actions build 4845406997
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	17.0.6+10	4845406997.11	logs
integration	17.0.6+10	4845406997.0	logs

gatk-bot · 2023-05-01T18:34:21Z

Github actions tests reported job failures from actions build 4853456806
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	17.0.6+10	4853456806.11	logs
integration	17.0.6+10	4853456806.0	logs

gatk-bot · 2023-05-01T18:34:47Z

Github actions tests reported job failures from actions build 4853463316
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	17.0.6+10	4853463316.11	logs
integration	17.0.6+10	4853463316.0	logs

cwhelan

This looks good @mwalker174 . I think most of my comments are more questions than hard suggestions.

cwhelan · 2023-05-05T13:18:57Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java

-            positionB = variant.getEnd();
+            // Force reset of END coordinate
+            if (type.equals(StructuralVariantType.INS)) {
+                positionB = positionA + 1;


Do we have some insertion records with END - POS > 1 to indicate microdeletions at the source? Do we definitely always want to reset the END of those here?

Yes that's true, but so far the GATK tools we are using in the pipeline would not affect this where it matters (GenerateBatchMetrics through GenotypeBatch).

In the future, however, I'd like to use POS/END to mark the nominal location of the insertion but separate INFO tags to indicate the left/right SR coordinates. This will avoid difficulties with making sure POS precedes END, which really should not be used for marking the SR positions IMO according to the VCF spec.

Totally agreed, we should have separate info fields for SR positions.

cwhelan · 2023-05-05T13:23:14Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

+        if (items.size() == 1) {
+            return items.get(0).getLog10PError();
+        } else {
+            return null;


Is there any case to be made to take the minimum here? I'm not sure what this is used for downstream yet but it seems like if we are collapsing variants that we assume to be the same our confidence that the variant is real should be the min error.

I was conservative here by marking them as null. You could take the min, but I'd argue this could be misleading and you'd really want to go back and recalculate variant qualities after collapsing.

cwhelan · 2023-05-05T14:03:27Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVLinkage.java

    public static final int DEFAULT_WINDOW_MIXED = 1000;
    public static final double DEFAULT_SAMPLE_OVERLAP_MIXED = 0;

    public static final double DEFAULT_RECIPROCAL_OVERLAP_PESR = 0.5;
+    public static final double DEFAULT_SIZE_SIMILARITY_PESR = 0;


What's the reason for keeping these defaults all at zero? Is using size similarity something that only needs to happen in special cases? Otherwise I might try to set some good defaults for everyday use.

These are what we use for ClusterBatch still. We'll have to go back and re-evaluate the effects of using size similarity, which right now I used more for SVConcordance in the new filtering workflows.

cwhelan · 2023-05-05T14:04:49Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVLinkage.java

-            if (params.requiresOverlapAndProximity() && !hasReciprocalOverlap) {
+            final int sizeSimilarityLengthA = getLength(a, INSERTION_ASSUMED_LENGTH_FOR_SIZE_SIMILARITY);
+            final int sizeSimilarityLengthB = getLength(b, INSERTION_ASSUMED_LENGTH_FOR_SIZE_SIMILARITY);
+            final boolean hasSizeSimilarity = Math.min(sizeSimilarityLengthA, sizeSimilarityLengthB) / (double) Math.max(sizeSimilarityLengthA, sizeSimilarityLengthB) >= params.getSizeSimilarity();


I might extract this into a little method, if only to make it easier for someone scanning the code to quickly find the definition of size similarity based on the class method list.

Done. Did the same for reciprocal overlap and breakend window as well.

cwhelan · 2023-05-05T14:07:45Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVLinkage.java

-            return false;
+        final boolean hasBreakendProximity = intervalA1.overlaps(intervalB1) && intervalA2.overlaps(intervalB2);
+        // Use short-circuiting statements since sample overlap is the least scalable / slowest check
+        if (hasReciprocalOverlapAndSizeSimilarity == null) {


Why not make this a lower case boolean and just use false? This clause won't trigger if it's Boolean(false) which seems counterintuitive.

Here I used null as a distinct case where the variants are interchromosomal, in which case the logic is different from False as with intrachromosomal variants. I found the code to be clearest this way (it's a bit complicated since there are 3 booleans) although maybe there is a better way to write it?

I see. I guess I'd advocating for just re-checking the intrachromosal status explicitly here again here instead of using null, but it's not a big deal either way.

cwhelan · 2023-05-05T14:16:35Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/JointGermlineCNVSegmentation.java

@@ -209,7 +209,7 @@ public void onTraversalStart() {

        setIntervals(parser);

-        final ClusteringParameters clusterArgs = ClusteringParameters.createDepthParameters(clusterIntervalOverlap, clusterWindow, CLUSTER_SAMPLE_OVERLAP_FRACTION);
+        final ClusteringParameters clusterArgs = ClusteringParameters.createDepthParameters(clusterIntervalOverlap, 0, clusterWindow, CLUSTER_SAMPLE_OVERLAP_FRACTION);


Either a comment or a using a named constant might help readers understand that size similarity isn't used here.

Added this as another tool argument

cwhelan · 2023-05-05T14:23:35Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVConcordance.java

@@ -121,6 +122,20 @@ public void onTraversalStart() {
            throw new UserException("Reference sequence dictionary required");
        }

+        // Check that vcfs are sorted the same
+        final SequenceDictionaryUtils.SequenceDictionaryCompatibility compatibility =  SequenceDictionaryUtils
+                .compareDictionaries(getTruthHeader().getSequenceDictionary(),


Is it not possible to use validateDictionaries instead? Seems a shame to have to duplicate the exception message below.

Thanks for catching that. I had originally tried validating dictionaries within the superclass and validateDictionaries caused some other tests to fail (by also requiring contig indices match), but I eventually moved the check here and it's not a problem.

cwhelan

The new changes look good and make things a little easier to read. Thanks for answering my questions as well.

mwalker174 added 4 commits May 1, 2023 13:53

Add size similarity to SVCluster

4e7d682

Sanitize null attributes

c3b05b5

Check that vcf contigs sorted the same, with integration testing

d3e8c7a

Add FILTER and QUAL fields to SVCallRecord

a071f88

mwalker174 force-pushed the mw_svcluster_size_sim branch from 54a8658 to a071f88 Compare May 1, 2023 17:53

Move dictionary ordering check to SVConcordance

b059070

mwalker174 changed the title ~~Add size similarity to CanonicalSVLinkage~~ Size similarity linkage and bug fixes for SV matching tools May 3, 2023

cwhelan requested changes May 5, 2023

View reviewed changes

Address reviewer comments

fc4205d

cwhelan approved these changes May 12, 2023

View reviewed changes

mwalker174 merged commit 18edcd3 into master May 15, 2023

mwalker174 deleted the mw_svcluster_size_sim branch May 15, 2023 16:38

mwalker174 mentioned this pull request May 25, 2023

SVConcordance workflows update broadinstitute/gatk-sv#540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Size similarity linkage and bug fixes for SV matching tools #8257

Size similarity linkage and bug fixes for SV matching tools #8257

mwalker174 commented Mar 22, 2023 •

edited

Loading

gatk-bot commented Apr 30, 2023 •

edited

Loading

gatk-bot commented May 1, 2023 •

edited

Loading

gatk-bot commented May 1, 2023 •

edited

Loading

cwhelan left a comment

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan May 12, 2023

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan May 12, 2023

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan May 5, 2023

mwalker174 May 9, 2023

cwhelan left a comment

Size similarity linkage and bug fixes for SV matching tools #8257

Size similarity linkage and bug fixes for SV matching tools #8257

Conversation

mwalker174 commented Mar 22, 2023 • edited Loading

gatk-bot commented Apr 30, 2023 • edited Loading

gatk-bot commented May 1, 2023 • edited Loading

gatk-bot commented May 1, 2023 • edited Loading

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

mwalker174 commented Mar 22, 2023 •

edited

Loading

gatk-bot commented Apr 30, 2023 •

edited

Loading

gatk-bot commented May 1, 2023 •

edited

Loading

gatk-bot commented May 1, 2023 •

edited

Loading