-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SVCluster tool #7541
SVCluster tool #7541
Conversation
259cf88
to
27d6afa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwalker174 This is looking great, glad to see it so fleshed out.
newGenotypes.add(new GenotypeBuilder(g).alleles(genotypeAlleles).make()); | ||
} | ||
builder.genotypes(newGenotypes); | ||
if (record.getGenotypes().stream().anyMatch(g -> g.getAlleles().isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels inefficient to stream over the genotypes twice (first to check if any are empty, then to sanitize), even if the anyMatch stream stops at the first match. Why not just always return the mapped stream below? Or do you think the overhead costs of mapping and re-collecting outweigh the benefits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, this was a premature optimization on my part. I was thinking it would be relatively rare to have empty alleles and this would make it faster by not having to make a copy of the genotypes. On second thought, I'm not sure this would have a huge effect, and could lead to some puzzling performance inconsistency for different vcfs (e.g. a chrY vcf would usually run slower since it's likely to have a lot of empty genotypes). I've changed it to just simply call sanitize on every genotype.
for (final String sample : missingSamples) { | ||
final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(sample); | ||
if (attributes != null) { | ||
genotypeBuilder.attributes(attributes); | ||
final int ploidy = ploidyTable.get(sample, contig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some documentation in the method javadoc about how this logic works (and add @param
docs for refAlleleDefault
and ploidyTable
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.sorted() | ||
.collect(Collectors.toList()); | ||
// Alt alleles match already, or some alts vanished in genotype collapsing (and we keep them) | ||
if (genotypeAltAlleles.equals(sortedAltAlleles) || sortedAltAlleles.containsAll(genotypeAltAlleles)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the ||
seems redundant with the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably another premature optimization - was trying to avoid containsAll which is N^2 for two lists. These lists will likely be short. Again favoring performance consistency rather than optimizing the more common case, I've changed genotypeAltAlleles
to a Set
and will only check containsAll()
&& sampleAltAlleles.get(0).equals(Allele.SV_SIMPLE_DUP)) { | ||
return Collections.nCopies(expectedCopyNumber, Allele.NO_CALL); | ||
// Ploidy 0 | ||
if (expectedCopyNumber == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some places where you get eg coverage on Y in females because of segmental duplications between Y and other chromosomes. Do you think we always want to throw data from those places away? (maybe we do, just wanted to raise it as a discussion point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. On the one hand, I can see why it would be of interest and valid to preserve the calls there. On the other hand, it seems like a contradiction to have a duplication "on" a chromosome that doesn't exist. I think this highlights a weakness of the way we represent duplications.
For example, consider a seg dup present in 2 copies on the reference and a sample that matches reference (assume diploid for both). Then you technically have 4 copies of this particular sequence, but should that get called as a dup? Most would argue probably not, but in some sense it's not wrong. If you did decide to call it, would you call it at one or both? What if there were a duplication of one copy but you had insufficient information to determine which locus? Where do you put the call? Do you call it twice - once at each locus? Seems ambiguous.
I don't want to open that can of worms here. If we really wanted to, I think representing such events with an insertion might be practical. IMO, graph representations are the solution to these kinds of issues. I've added a TODO here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the issues you are talking about here in your second paragraph are behind some of the changes that are coming to the VCF spec in version 4.4. In particular, the SVCLAIM
info field will allow you to specify whether your deletion and duplication calls are really claiming that there are breakpoints at the specified location, or if the claim is simply that the copy number of the segment on the reference differs from expectations based on read depth. My initial thought when reading this segment of code was that representing things in this style might allow us to preserve the calls we are throwing away here. What do you think of that approach?
Genome STRiP measures the copy number of repeated sequences like segmental duplications and represents it by adding a "duplicate segments" info field. So for example if there are two copies of a segmental duplication in the reference on the autosomes (so you expect a reference sample to be copy number 4), and the copy number is measured to be 5 but we don't know which side of the seg dup is duplicated, it writes a record at one locus with the duplicate segment info field pointing to the other segment and the sample's CN format field set to 5. In VCF 4.4 this sort of thing would still make sense using a depth SVCLAIM type representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very realistic to adopt 4.4 in the output in the near future (or have a setting to select which spec to use), and it seems like it would be easy to set SVCLAIM at the end of the pipeline using the EVIDENCE field. Carrying it through certainly has its advantages, including resolving cases like the one you suggested above, but might be hard to implement in gatk-sv given the current scope of the pipeline.
@@ -401,8 +622,10 @@ protected Genotype collapseSampleGenotypes(final Collection<Genotype> genotypes, | |||
.collect(Collectors.groupingBy(Map.Entry::getKey, Collectors.mapping(Map.Entry::getValue, Collectors.toSet()))); | |||
for (final Map.Entry<String, Set<Object>> entry : genotypeFields.entrySet()) { | |||
if (entry.getKey().equals(GATKSVVCFConstants.COPY_NUMBER_FORMAT)) { | |||
collapsedAttributes.put(GATKSVVCFConstants.COPY_NUMBER_FORMAT, collapseSampleCopyNumber(genotypes, expectedCopyNumber)); | |||
// Handled elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this comment more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to:
// Copy number attribute set later using special logic in collapseSampleCopyNumber
// Unit testing is easier with this design
I made an attempt to move collapseSampleCopyNumber()
here, but it would have required some substantial changes to the unit tests. Then I also have to pull out CN
in the calling function, which requires effectively casting an object to an Integer
and thus introduces a bunch of extra code.
} | ||
|
||
public List<SVCallRecord> flush() { | ||
buffer.addAll(clusterEngine.getOutput()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty unusual object access pattern -- I think it would be more intuitive if the buffer class held a reference to the clusterEngine passed in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
public List<SVCallRecord> forceFlush() { | ||
buffer.addAll(clusterEngine.forceFlushAndGetOutput()); | ||
final List<SVCallRecord> result = new ArrayList<>(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't these records need to be sorted as in the flush()
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it should be
} else if (expectedCopyNumber == 0) { | ||
return Collections.emptyList(); | ||
} else if (expectedCopyNumber == 1) { | ||
// At this point know it's a DUP, and since it's haploid the phasing is unambiguous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the right thing to do given how we are using the output VCF downstream, but it's unfortunate that the allele is named SV_SIMPLE_DUP
which implies a single duplication of the segment. I'm not sure how best to make it less confusing, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO here noting that we need a generic multi-copy DUP allele for this
} else { | ||
final String messageAlleles = String.join(", ", | ||
siteAltAlleles.stream().map(Allele::getDisplayString).collect(Collectors.toList())); | ||
throw new IllegalArgumentException("Unsupported CNV alt alleles: " + messageAlleles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to throw in a unit test that hits this (unless you already did and I missed it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also added one for the exception below
} | ||
|
||
@DataProvider(name = "collapseSampleGenotypesTestData") | ||
public Object[][] collapseSampleGenotypesTestData() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are very nice. I worry a little that they are somewhat hard to find initially for someone doing a bug fix -- say you wanted to fix a problem in the method CanonicalSVCollapser.getCNVGenotypeAllelesFromCopyNumber()
-- you'd have to know to come to SVCollapserTest.collapseSampleGenotypesTestData()
to find your test case. This class only seems to be testing the CanonicalSVCollapser
-- what about renaming it to CanonicalSVCollapserUnitTest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion (I think I forgot to change this after refactoring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cwhelan, I think I've addressed all comments.
newGenotypes.add(new GenotypeBuilder(g).alleles(genotypeAlleles).make()); | ||
} | ||
builder.genotypes(newGenotypes); | ||
if (record.getGenotypes().stream().anyMatch(g -> g.getAlleles().isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, this was a premature optimization on my part. I was thinking it would be relatively rare to have empty alleles and this would make it faster by not having to make a copy of the genotypes. On second thought, I'm not sure this would have a huge effect, and could lead to some puzzling performance inconsistency for different vcfs (e.g. a chrY vcf would usually run slower since it's likely to have a lot of empty genotypes). I've changed it to just simply call sanitize on every genotype.
for (final String sample : missingSamples) { | ||
final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(sample); | ||
if (attributes != null) { | ||
genotypeBuilder.attributes(attributes); | ||
final int ploidy = ploidyTable.get(sample, contig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.sorted() | ||
.collect(Collectors.toList()); | ||
// Alt alleles match already, or some alts vanished in genotype collapsing (and we keep them) | ||
if (genotypeAltAlleles.equals(sortedAltAlleles) || sortedAltAlleles.containsAll(genotypeAltAlleles)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably another premature optimization - was trying to avoid containsAll which is N^2 for two lists. These lists will likely be short. Again favoring performance consistency rather than optimizing the more common case, I've changed genotypeAltAlleles
to a Set
and will only check containsAll()
&& sampleAltAlleles.get(0).equals(Allele.SV_SIMPLE_DUP)) { | ||
return Collections.nCopies(expectedCopyNumber, Allele.NO_CALL); | ||
// Ploidy 0 | ||
if (expectedCopyNumber == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. On the one hand, I can see why it would be of interest and valid to preserve the calls there. On the other hand, it seems like a contradiction to have a duplication "on" a chromosome that doesn't exist. I think this highlights a weakness of the way we represent duplications.
For example, consider a seg dup present in 2 copies on the reference and a sample that matches reference (assume diploid for both). Then you technically have 4 copies of this particular sequence, but should that get called as a dup? Most would argue probably not, but in some sense it's not wrong. If you did decide to call it, would you call it at one or both? What if there were a duplication of one copy but you had insufficient information to determine which locus? Where do you put the call? Do you call it twice - once at each locus? Seems ambiguous.
I don't want to open that can of worms here. If we really wanted to, I think representing such events with an insertion might be practical. IMO, graph representations are the solution to these kinds of issues. I've added a TODO here.
} else { | ||
final String messageAlleles = String.join(", ", | ||
siteAltAlleles.stream().map(Allele::getDisplayString).collect(Collectors.toList())); | ||
throw new IllegalArgumentException("Unsupported CNV alt alleles: " + messageAlleles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also added one for the exception below
) | ||
private double defragPaddingFraction = CNVLinkage.DEFAULT_PADDING_FRACTION; | ||
|
||
@Argument(fullName = DEFRAG_SAMPLE_OVERLAP_LONG_NAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I struggled with this issue. I'd rather not overload the arguments from SVClusterEngineArgumentsCollection
since I think it should have a different default value. One option is to create a separate tool for defragmentation, but that would require extra code. I've modified the docstring to be a little clearer:
Minimum sample overlap fraction. Use instead of --depth-sample-overlap in CNV defragmentation mode.
and also for padding:
Padding as a fraction of variant length for CNV defragmentation mode.
clusterParameterArgs.getDepthParameters(), clusterParameterArgs.getMixedParameters(), | ||
clusterParameterArgs.getPESRParameters()); | ||
} else { | ||
throw new UnsupportedOperationException("Unsupported algorithm: " + algorithm.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
public List<SVCallRecord> flush() { | ||
buffer.addAll(clusterEngine.getOutput()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
public List<SVCallRecord> forceFlush() { | ||
buffer.addAll(clusterEngine.forceFlushAndGetOutput()); | ||
final List<SVCallRecord> result = new ArrayList<>(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it should be
} | ||
|
||
@DataProvider(name = "collapseSampleGenotypesTestData") | ||
public Object[][] collapseSampleGenotypesTestData() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion (I think I forgot to change this after refactoring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, they look good. I left a couple of comments that relate to the VCF spec and its future. Feel free to respond to them and continue the conversation or just use them as food for thought for the future directions of the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really just trivial changes to the gCNV bits, so those are fine as far as I'm concerned.
Implements tool for clustering SVs, built on top of the clustering engine code refined recently in #7243. In addition to a few bug fixes, updates also include:
PloidyTable
class, which ingests and serves as a simple data class for a tsv of per-sample contig ploidies. This was necessary for inferring genotypes when input vcfs contain non-matching sample and variant records.SVClusterEngine
to render sorted output.CanonicalSVCollapser
), particularly for CNVs. Genotype collapsing now infers allele phasing in certain unambiguous cases, in particular for DUPs and multi-allelic CNVs. Testing for this has been cleaned up and augmented with further cases to validate this functionality.