Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size similarity linkage and bug fixes for SV matching tools #8257

Merged
merged 6 commits into from
May 15, 2023

Conversation

mwalker174
Copy link
Contributor

@mwalker174 mwalker174 commented Mar 22, 2023

  • Adds size similarity criterion to SVConcordance and SVCluster tools. This is particularly useful for accurately matching smaller SVs that have a high degree of breakpoint uncertainty, in which case reciprocal overlap does not work well. PESR/mixed variant types must have size similarity, reciprocal overlap, and breakend window criteria met. Depth-only variants may have either size similarity + reciprocal overlap OR breakend window criteria met (or both).
  • Rewrites some of the linkage logic to be simpler to read.
  • Fixes a rare bug with SortedMultiset in SVClusterEngine that sometimes caused records with identical start positions to get lost.
  • Removes null record attributes to avoid . INFO/FORMAT fields, which cause a parsing error with Integer types.
  • Add check that the vcf header contigs are sorted in the same order.
  • Retain FILTER and QUAL fields in output.

@gatk-bot
Copy link

gatk-bot commented Apr 30, 2023

Github actions tests reported job failures from actions build 4845406997
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 4845406997.11 logs
integration 17.0.6+10 4845406997.0 logs

@mwalker174 mwalker174 force-pushed the mw_svcluster_size_sim branch from 54a8658 to a071f88 Compare May 1, 2023 17:53
@gatk-bot
Copy link

gatk-bot commented May 1, 2023

Github actions tests reported job failures from actions build 4853456806
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 4853456806.11 logs
integration 17.0.6+10 4853456806.0 logs

@gatk-bot
Copy link

gatk-bot commented May 1, 2023

Github actions tests reported job failures from actions build 4853463316
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 17.0.6+10 4853463316.11 logs
integration 17.0.6+10 4853463316.0 logs

@mwalker174 mwalker174 changed the title Add size similarity to CanonicalSVLinkage Size similarity linkage and bug fixes for SV matching tools May 3, 2023
Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @mwalker174 . I think most of my comments are more questions than hard suggestions.

positionB = variant.getEnd();
// Force reset of END coordinate
if (type.equals(StructuralVariantType.INS)) {
positionB = positionA + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some insertion records with END - POS > 1 to indicate microdeletions at the source? Do we definitely always want to reset the END of those here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true, but so far the GATK tools we are using in the pipeline would not affect this where it matters (GenerateBatchMetrics through GenotypeBatch).

In the future, however, I'd like to use POS/END to mark the nominal location of the insertion but separate INFO tags to indicate the left/right SR coordinates. This will avoid difficulties with making sure POS precedes END, which really should not be used for marking the SR positions IMO according to the VCF spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed, we should have separate info fields for SR positions.

if (items.size() == 1) {
return items.get(0).getLog10PError();
} else {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case to be made to take the minimum here? I'm not sure what this is used for downstream yet but it seems like if we are collapsing variants that we assume to be the same our confidence that the variant is real should be the min error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was conservative here by marking them as null. You could take the min, but I'd argue this could be misleading and you'd really want to go back and recalculate variant qualities after collapsing.

public static final int DEFAULT_WINDOW_MIXED = 1000;
public static final double DEFAULT_SAMPLE_OVERLAP_MIXED = 0;

public static final double DEFAULT_RECIPROCAL_OVERLAP_PESR = 0.5;
public static final double DEFAULT_SIZE_SIMILARITY_PESR = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for keeping these defaults all at zero? Is using size similarity something that only needs to happen in special cases? Otherwise I might try to set some good defaults for everyday use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are what we use for ClusterBatch still. We'll have to go back and re-evaluate the effects of using size similarity, which right now I used more for SVConcordance in the new filtering workflows.

if (params.requiresOverlapAndProximity() && !hasReciprocalOverlap) {
final int sizeSimilarityLengthA = getLength(a, INSERTION_ASSUMED_LENGTH_FOR_SIZE_SIMILARITY);
final int sizeSimilarityLengthB = getLength(b, INSERTION_ASSUMED_LENGTH_FOR_SIZE_SIMILARITY);
final boolean hasSizeSimilarity = Math.min(sizeSimilarityLengthA, sizeSimilarityLengthB) / (double) Math.max(sizeSimilarityLengthA, sizeSimilarityLengthB) >= params.getSizeSimilarity();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might extract this into a little method, if only to make it easier for someone scanning the code to quickly find the definition of size similarity based on the class method list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Did the same for reciprocal overlap and breakend window as well.

return false;
final boolean hasBreakendProximity = intervalA1.overlaps(intervalB1) && intervalA2.overlaps(intervalB2);
// Use short-circuiting statements since sample overlap is the least scalable / slowest check
if (hasReciprocalOverlapAndSizeSimilarity == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a lower case boolean and just use false? This clause won't trigger if it's Boolean(false) which seems counterintuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I used null as a distinct case where the variants are interchromosomal, in which case the logic is different from False as with intrachromosomal variants. I found the code to be clearest this way (it's a bit complicated since there are 3 booleans) although maybe there is a better way to write it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess I'd advocating for just re-checking the intrachromosal status explicitly here again here instead of using null, but it's not a big deal either way.

@@ -209,7 +209,7 @@ public void onTraversalStart() {

setIntervals(parser);

final ClusteringParameters clusterArgs = ClusteringParameters.createDepthParameters(clusterIntervalOverlap, clusterWindow, CLUSTER_SAMPLE_OVERLAP_FRACTION);
final ClusteringParameters clusterArgs = ClusteringParameters.createDepthParameters(clusterIntervalOverlap, 0, clusterWindow, CLUSTER_SAMPLE_OVERLAP_FRACTION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either a comment or a using a named constant might help readers understand that size similarity isn't used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as another tool argument

@@ -121,6 +122,20 @@ public void onTraversalStart() {
throw new UserException("Reference sequence dictionary required");
}

// Check that vcfs are sorted the same
final SequenceDictionaryUtils.SequenceDictionaryCompatibility compatibility = SequenceDictionaryUtils
.compareDictionaries(getTruthHeader().getSequenceDictionary(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to use validateDictionaries instead? Seems a shame to have to duplicate the exception message below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. I had originally tried validating dictionaries within the superclass and validateDictionaries caused some other tests to fail (by also requiring contig indices match), but I eventually moved the check here and it's not a problem.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes look good and make things a little easier to read. Thanks for answering my questions as well.

@mwalker174 mwalker174 merged commit 18edcd3 into master May 15, 2023
@mwalker174 mwalker174 deleted the mw_svcluster_size_sim branch May 15, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants