-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(SV) output merged single VCF for new interpretation tool #4996
Conversation
ece04d2
to
0158e61
Compare
Codecov Report
@@ Coverage Diff @@
## master #4996 +/- ##
===============================================
- Coverage 86.385% 86.376% -0.009%
- Complexity 28822 28835 +13
===============================================
Files 1791 1791
Lines 133561 133762 +201
Branches 14902 14919 +17
===============================================
+ Hits 115377 115538 +161
- Misses 12791 12818 +27
- Partials 5393 5406 +13
|
601eff3
to
ccd5f87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly fine, I have some minor suggestions about how to restructure a little of the code.
@@ -269,11 +269,12 @@ public void validate() { | |||
|
|||
public enum SvEvidenceFilterType {DENSITY, XGBOOST} | |||
|
|||
public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable { | |||
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this name, I thought it read better the old way. Actually, though, both have inconsistent pluralization, it should either be: DiscoverVariantsFromContigAlignmentsSparkArgumentCollection
or VariantDiscoveryFromContigAlignmentsSparkArgumentCollection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picked DiscoverVariantsFromContigAlignmentsSparkArgumentCollection
as suggested.
@@ -269,11 +269,12 @@ public void validate() { | |||
|
|||
public enum SvEvidenceFilterType {DENSITY, XGBOOST} | |||
|
|||
public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable { | |||
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable { | |||
private static final long serialVersionUID = 1L; | |||
|
|||
public static final int GAPPED_ALIGNMENT_BREAK_DEFAULT_SENSITIVITY = STRUCTURAL_VARIANT_SIZE_LOWER_BOUND; // alignment with gap of size >= 50 will be broken apart. | |||
public static final int CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD = 60; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a TODO comment saying that this parameter will be removed when we remove the old tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -43,9 +43,11 @@ | |||
import org.broadinstitute.hellbender.utils.read.GATKRead; | |||
import org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter; | |||
import scala.Serializable; | |||
import scala.Tuple3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this import isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVIntervalTree; | ||
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVUtils; | ||
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVVCFWriter; | ||
import org.broadinstitute.hellbender.utils.Utils; | ||
import org.broadinstitute.hellbender.utils.io.IOUtils; | ||
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
import scala.Tuple2; | ||
import scala.Tuple3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// TODO: 1/10/18 bring back read annotation, see ticket 4228 | ||
final List<VariantContext> simpleChimeraVariants = | ||
extractSimpleVariants(contigsByPossibleRawTypes.simple, svDiscoveryInputMetaData, outputPrefixWithSampleName); | ||
contigsByPossibleRawTypes.simple.unpersist(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just moved around this code for this PR, but looking at this I'm not sure I like this method having the side effect of calling unpersist on the internal RDDs of contigsByPossibleRawTypes
. Just to avoid side effects I'd prefer to pull out the unpersist
calls to the calling method. Actually, since you've got them bundled in one object, why not make an unpersist
method on AssemblyContigsClassifiedByAlignmentSignatures
that unpersists all of the underlying RDDs, and then it's just a one line invocation from the callers of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That's much better.
} | ||
} | ||
|
||
public static final class SVAlnFilter implements StructuralVariantFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write out SVAlignmentLengthFilter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -154,4 +153,27 @@ public static void writeSAMRecords(final List<GATKRead> reads, final Set<String> | |||
samRecords.sort(localComparator); | |||
SVFileUtils.writeSAMFile( outputPath, samRecords.iterator(), cloneHeader, true); | |||
} | |||
|
|||
/** | |||
* this exist because for whatever reason, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put in a TODO comment reminding us to get to the bottom of this behavior. If it's something wrong with our data, we should fix it, otherwise we should file a bug against htsjdk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added todo.
I've talked with Louis and he is aware of the issue (this happens not only to our VCF) since a few months ago. Hopefully it will be resolved.
@@ -113,6 +88,12 @@ private ExtractedSimpleVariants(final List<VariantContext> reInterpretZeroOrOneS | |||
public List<VariantContext> getReInterpretMultiSegmentsCalls() { | |||
return reInterpretMultiSegmentsCalls; | |||
} | |||
|
|||
public List<VariantContext> getMergedReInterpretCalls() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getMergedReinterpretedCalls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
||
public static final Stream<String> expectedAltAlleleHeaderKeysInVCF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these should be in test code somewhere, not in main code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
I put these here because it's difficult otherwise to remind people to add new annotations to our VCF header. And if we don't, the VCF file is, technically speaking, corrupt.
I tried to use some Java language features to get the fields declared here, and test if the keys are all present in the test VCF files we have, but that turns out to be ugly.
Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we added some assertions to our integration test that checked whether everything is in the header? We could collect all of the alt alleles and info and format annotation keys we see in the variants in StructuralVariationDiscoveryPipelineSparkIntegrationTest.svDiscoveryVCFEquivalenceTest
and assert that they have corresponding header line entries. Would that address your concern? You could split that work off of this PR, too, and make a ticket for it.
} | ||
|
||
// for variants-detected from assembly | ||
// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec | ||
{ | ||
{// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already do this? Do you know what needs to happen to make this spec-compliant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are already doing this in the sense that the assembly contig names are recorded.
To be spec-compliant, specifically spec version 4.2, section 5.4.2 (what we are doing in this case is NOT violating the spec; we basically offer the convenience of getting the exact inserted sequence without having to parse another file, at the cost of larger file size)
- we currently always pull the inserted sequence (or whatever sequence we need) from those contigs and place it here in the VCF records. As you can imagine sometimes these sequences can be quite long.
- BND variants are going to be reported differently, as the inserted sequence between the novel adjacencies are currently reported explicitly in the VCF records. Again, convenience vs file size.
- the assembly file needs to be revised in case the assembler gives the "reverse strand" representation of the event, and this could be a little bit more complicated since the assembly file, assuming it is going to be an GFA rather than SAM file, has connection information, hence revising it means the connections need to be revised as well (which can be a little tricky).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Just checking to see if you wanted to keep this todo comment around or if it could be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwhelan Thanks for the review! I've incorporated most of the suggested changes except the comment for GATKSVVCFConstants.java.expectedAltAlleleHeaderKeysInVCF
, which I don't know what the best option is yet.
@@ -43,9 +43,11 @@ | |||
import org.broadinstitute.hellbender.utils.read.GATKRead; | |||
import org.broadinstitute.hellbender.utils.read.SAMRecordToGATKReadAdapter; | |||
import scala.Serializable; | |||
import scala.Tuple3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -269,11 +269,12 @@ public void validate() { | |||
|
|||
public enum SvEvidenceFilterType {DENSITY, XGBOOST} | |||
|
|||
public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable { | |||
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable { | |||
private static final long serialVersionUID = 1L; | |||
|
|||
public static final int GAPPED_ALIGNMENT_BREAK_DEFAULT_SENSITIVITY = STRUCTURAL_VARIANT_SIZE_LOWER_BOUND; // alignment with gap of size >= 50 will be broken apart. | |||
public static final int CHIMERIC_ALIGNMENTS_HIGHMQ_THRESHOLD = 60; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVIntervalTree; | ||
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVUtils; | ||
import org.broadinstitute.hellbender.tools.spark.sv.utils.SVVCFWriter; | ||
import org.broadinstitute.hellbender.utils.Utils; | ||
import org.broadinstitute.hellbender.utils.io.IOUtils; | ||
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
import scala.Tuple2; | ||
import scala.Tuple3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
// for variants-detected from assembly | ||
// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec | ||
{ | ||
{// todo: create an alternate assembly file and link to it with breakpoint IDs according to the VCF spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are already doing this in the sense that the assembly contig names are recorded.
To be spec-compliant, specifically spec version 4.2, section 5.4.2 (what we are doing in this case is NOT violating the spec; we basically offer the convenience of getting the exact inserted sequence without having to parse another file, at the cost of larger file size)
- we currently always pull the inserted sequence (or whatever sequence we need) from those contigs and place it here in the VCF records. As you can imagine sometimes these sequences can be quite long.
- BND variants are going to be reported differently, as the inserted sequence between the novel adjacencies are currently reported explicitly in the VCF records. Again, convenience vs file size.
- the assembly file needs to be revised in case the assembler gives the "reverse strand" representation of the event, and this could be a little bit more complicated since the assembly file, assuming it is going to be an GFA rather than SAM file, has connection information, hence revising it means the connections need to be revised as well (which can be a little tricky).
|
||
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
||
public static final Stream<String> expectedAltAlleleHeaderKeysInVCF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
I put these here because it's difficult otherwise to remind people to add new annotations to our VCF header. And if we don't, the VCF file is, technically speaking, corrupt.
I tried to use some Java language features to get the fields declared here, and test if the keys are all present in the test VCF files we have, but that turns out to be ugly.
Do you have any suggestions?
new SVAlnFilter(svDiscoveryInputMetaData.getDiscoverStageArgs().minAlignLength)); | ||
for (final VariantContext variant : variants) { | ||
String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, ""); | ||
if (svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_DEL) || svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_INS) || svType.equals(GATKSVVCFConstants.SYMB_ALT_ALLELE_DUP)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. This is some mis-communication:
Two types of variants are being treated specially here: BND and INV.
For BND, the record simply doesn't have a SVLEN
now in our implementation. And IMO they should not.
For INV records, in this new interpretation tool, they will have exactly length 0, following the technical definition of SVLEN
, which reads (at the beginning of Section 3, VCF spec version 4.2)
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
In the stable versioned interpretation tool, SVLEN
for inversions are essentially "the number of reference bases inverted + the length of inserted sequence at breakpoints". So this is a breaking change.
BTW, I think the Number
field of SVLEN
is wrong; it should be Number=A
, but spec being the spec, I cannot change it, yet....
final List<VariantContext> inversions = extractInversions();// TODO: 6/29/18 placeholder | ||
|
||
// merged output | ||
final List<VariantContext> merged = new ArrayList<>(20_000); // estimated size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
variantsWithFilterApplied.add(postHocFilterVariant(variant, filters)); | ||
} | ||
|
||
final String out = outputPrefixWithSampleName + "merged_simple.vcf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// TODO: 1/10/18 bring back read annotation, see ticket 4228 | ||
final List<VariantContext> simpleChimeraVariants = | ||
extractSimpleVariants(contigsByPossibleRawTypes.simple, svDiscoveryInputMetaData, outputPrefixWithSampleName); | ||
contigsByPossibleRawTypes.simple.unpersist(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That's much better.
@@ -269,11 +269,12 @@ public void validate() { | |||
|
|||
public enum SvEvidenceFilterType {DENSITY, XGBOOST} | |||
|
|||
public static class DiscoverVariantsFromContigsAlignmentsSparkArgumentCollection implements Serializable { | |||
public static class VariantsDiscoveryFromContigsAlignmentsSparkArgumentCollection implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picked DiscoverVariantsFromContigAlignmentsSparkArgumentCollection
as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This looks good now except for my two minor comments.
45f8106
to
8cdcd26
Compare
* refactor some utilities and GATKSVVCFConstants and GATKSVVCFHeaderLines * group methods in StructuralVariationDiscoveryPipelineSpark by functionality * bring MAX_ALIGN_LENGTH and MAPPING_QUALITIES annotations from CPX variants to re-interpreted simple variants * add new CLI argument and filter assembly based variants based on annotation MAPPING_QUALITIES, MAX_ALIGN_LENGTH * filter out variants of size < 50
8cdcd26
to
ff37d02
Compare
One more step towards using this new tool.
Does:
<INS>
,<DEL>
,<DUP>
,<INV>
calls (there will be more<INV>
calls, but that cannot happen until someone takes a look at PR Inversion breakpoint linking (no merge yet) #4789 and check if the proposed algorithm makes sense)MAPPING_QUALITIES
andMAX_ALIGN_LENGTH
; the default value is chosen after some experimentation using the CHM PacBio as truth and the branchsh-sv-interlvatree-eval.