-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Size similarity linkage and bug fixes for SV matching tools #8257
Conversation
Github actions tests reported job failures from actions build 4845406997
|
54a8658
to
a071f88
Compare
Github actions tests reported job failures from actions build 4853456806
|
Github actions tests reported job failures from actions build 4853463316
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @mwalker174 . I think most of my comments are more questions than hard suggestions.
positionB = variant.getEnd(); | ||
// Force reset of END coordinate | ||
if (type.equals(StructuralVariantType.INS)) { | ||
positionB = positionA + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some insertion records with END - POS > 1
to indicate microdeletions at the source? Do we definitely always want to reset the END of those here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true, but so far the GATK tools we are using in the pipeline would not affect this where it matters (GenerateBatchMetrics through GenotypeBatch).
In the future, however, I'd like to use POS/END to mark the nominal location of the insertion but separate INFO tags to indicate the left/right SR coordinates. This will avoid difficulties with making sure POS precedes END, which really should not be used for marking the SR positions IMO according to the VCF spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed, we should have separate info fields for SR positions.
if (items.size() == 1) { | ||
return items.get(0).getLog10PError(); | ||
} else { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case to be made to take the minimum here? I'm not sure what this is used for downstream yet but it seems like if we are collapsing variants that we assume to be the same our confidence that the variant is real should be the min error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was conservative here by marking them as null. You could take the min, but I'd argue this could be misleading and you'd really want to go back and recalculate variant qualities after collapsing.
public static final int DEFAULT_WINDOW_MIXED = 1000; | ||
public static final double DEFAULT_SAMPLE_OVERLAP_MIXED = 0; | ||
|
||
public static final double DEFAULT_RECIPROCAL_OVERLAP_PESR = 0.5; | ||
public static final double DEFAULT_SIZE_SIMILARITY_PESR = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for keeping these defaults all at zero? Is using size similarity something that only needs to happen in special cases? Otherwise I might try to set some good defaults for everyday use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are what we use for ClusterBatch still. We'll have to go back and re-evaluate the effects of using size similarity, which right now I used more for SVConcordance in the new filtering workflows.
if (params.requiresOverlapAndProximity() && !hasReciprocalOverlap) { | ||
final int sizeSimilarityLengthA = getLength(a, INSERTION_ASSUMED_LENGTH_FOR_SIZE_SIMILARITY); | ||
final int sizeSimilarityLengthB = getLength(b, INSERTION_ASSUMED_LENGTH_FOR_SIZE_SIMILARITY); | ||
final boolean hasSizeSimilarity = Math.min(sizeSimilarityLengthA, sizeSimilarityLengthB) / (double) Math.max(sizeSimilarityLengthA, sizeSimilarityLengthB) >= params.getSizeSimilarity(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might extract this into a little method, if only to make it easier for someone scanning the code to quickly find the definition of size similarity based on the class method list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Did the same for reciprocal overlap and breakend window as well.
return false; | ||
final boolean hasBreakendProximity = intervalA1.overlaps(intervalB1) && intervalA2.overlaps(intervalB2); | ||
// Use short-circuiting statements since sample overlap is the least scalable / slowest check | ||
if (hasReciprocalOverlapAndSizeSimilarity == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this a lower case boolean
and just use false
? This clause won't trigger if it's Boolean(false)
which seems counterintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I used null
as a distinct case where the variants are interchromosomal, in which case the logic is different from False
as with intrachromosomal variants. I found the code to be clearest this way (it's a bit complicated since there are 3 booleans) although maybe there is a better way to write it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I guess I'd advocating for just re-checking the intrachromosal status explicitly here again here instead of using null, but it's not a big deal either way.
@@ -209,7 +209,7 @@ public void onTraversalStart() { | |||
|
|||
setIntervals(parser); | |||
|
|||
final ClusteringParameters clusterArgs = ClusteringParameters.createDepthParameters(clusterIntervalOverlap, clusterWindow, CLUSTER_SAMPLE_OVERLAP_FRACTION); | |||
final ClusteringParameters clusterArgs = ClusteringParameters.createDepthParameters(clusterIntervalOverlap, 0, clusterWindow, CLUSTER_SAMPLE_OVERLAP_FRACTION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either a comment or a using a named constant might help readers understand that size similarity isn't used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this as another tool argument
@@ -121,6 +122,20 @@ public void onTraversalStart() { | |||
throw new UserException("Reference sequence dictionary required"); | |||
} | |||
|
|||
// Check that vcfs are sorted the same | |||
final SequenceDictionaryUtils.SequenceDictionaryCompatibility compatibility = SequenceDictionaryUtils | |||
.compareDictionaries(getTruthHeader().getSequenceDictionary(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to use validateDictionaries
instead? Seems a shame to have to duplicate the exception message below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. I had originally tried validating dictionaries within the superclass and validateDictionaries
caused some other tests to fail (by also requiring contig indices match), but I eventually moved the check here and it's not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes look good and make things a little easier to read. Thanks for answering my questions as well.
SortedMultiset
inSVClusterEngine
that sometimes caused records with identical start positions to get lost..
INFO/FORMAT fields, which cause a parsing error with Integer types.