Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choose one disjoint read to report #5926

Merged
merged 1 commit into from
May 20, 2019
Merged

Conversation

tedsharpe
Copy link
Contributor

write BAM of rejected reads
trim reads to length of short fragments

@tedsharpe tedsharpe requested a review from SHuang-Broad May 7, 2019 20:38
@codecov
Copy link

codecov bot commented May 7, 2019

Codecov Report

Merging #5926 into master will decrease coverage by 0.003%.
The diff coverage is 37.975%.

@@              Coverage Diff               @@
##              master    #5926       +/-   ##
==============================================
- Coverage     86.823%   86.82%   -0.003%     
- Complexity     32344    32348        +4     
==============================================
  Files           1993     1993               
  Lines         149460   149474       +14     
  Branches       16502    16521       +19     
==============================================
+ Hits          129766   129774        +8     
+ Misses         13679    13676        -3     
- Partials        6015     6024        +9
Impacted Files Coverage Δ Complexity Δ
...er/tools/AnalyzeSaturationMutagenesisUnitTest.java 97.878% <100%> (+0.029%) 29 <0> (ø) ⬇️
...hellbender/tools/AnalyzeSaturationMutagenesis.java 46.945% <30.986%> (-0.023%) 15 <5> (+4)
...lotypecaller/readthreading/ReadThreadingGraph.java 88.725% <0%> (-0.245%) 159% <0%> (ø)

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except several nonessential SE related comments, I have

  • a question, "I'm not understanding why disjoint => inconsistent pair." on line 1723
  • a suggestion regarding lines 1611-1627.

private static final MoleculeCounts moleculeCountsUnpaired = new MoleculeCounts();
private static final MoleculeCounts moleculeCountsDisjointPair = new MoleculeCounts();
private static final MoleculeCounts moleculeCountsOverlappingPair = new MoleculeCounts();
private static final String[] unpairedLabels = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a candidate for enum but not critical.
And if you are bored, a class hierarchy for MoleculeCounts is possible I agree it feels like an overkill for now, though I do feel the last several revisions are going that direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see how an enum would help. Class hierarchy would work, but I'm going to duck it for now since I'm just replacing labels.

} else { // mates are disjoint
final String ignoredMate = "ignoredMate";
if ( read1.isFirstOfPair() ) {
processReport(read1, report1, moleculeCountsDisjointPair);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to loose weight:

final GATKRead rd = read1.isFirstOfPair() ? read1 : read2;
final ReadReport rp = read1.isFirstOfPair() ? report1 : report2;
processReport(rd, rp, moleculeCountsDisjointPair);
if ( bamWriter != null ) {
    rd.setAttribute(EXCUSE_ATTRIBUTE, ignoredMate);
    bamWriter.addRead(rd);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're processing the report on the selected read, but setting the attribute on the rejected read. This doesn't quite work.

bamWriter.addRead(read1);
}
}
moleculeCountsDisjointPair.bumpInconsistentPairs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding why disjoint => inconsistent pair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because it's an ugly hack. Tried to improve it a little.

}

// don't process past end of fragment when fragment length < read length
if ( read.isProperlyPaired() ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Interval trim is being changed in this new if-block, I would propose absorbing this block into calculateTrim(...).
I tried to use "Refactor -> Extract -> Method", and the IDE tells me that it is impossible only because of the two statements return documentLowQualityRead(read), but I think it is still do-able by returning a 0-sized interval in calculateTrim(...), just like what happened in lines 1606-1608.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored along the lines of what you suggested into a method that does the quality trim (the original calculateTrim code), and a 2nd that does the short fragment trim.

Copy link
Contributor Author

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you know what? Your valid criticism of the MoleculeCounts organization has left me feeling dirty.
It's too much of a hack. I'll rework it and let you know when I'm ready for another look.

private static final MoleculeCounts moleculeCountsUnpaired = new MoleculeCounts();
private static final MoleculeCounts moleculeCountsDisjointPair = new MoleculeCounts();
private static final MoleculeCounts moleculeCountsOverlappingPair = new MoleculeCounts();
private static final String[] unpairedLabels = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see how an enum would help. Class hierarchy would work, but I'm going to duck it for now since I'm just replacing labels.

}

// don't process past end of fragment when fragment length < read length
if ( read.isProperlyPaired() ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored along the lines of what you suggested into a method that does the quality trim (the original calculateTrim code), and a 2nd that does the short fragment trim.

} else { // mates are disjoint
final String ignoredMate = "ignoredMate";
if ( read1.isFirstOfPair() ) {
processReport(read1, report1, moleculeCountsDisjointPair);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're processing the report on the selected read, but setting the attribute on the rejected read. This doesn't quite work.

bamWriter.addRead(read1);
}
}
moleculeCountsDisjointPair.bumpInconsistentPairs();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because it's an ugly hack. Tried to improve it a little.

Copy link
Contributor Author

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SHuang-Broad Steve, this is ready for another look. I created an enum and redid the counters. I think it feels a little less hacky now.

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several trivial comments, except one that is out of curiosity, regarding line 1640

just curious, is it ever possible, that both coverages are empty?

If you feel this question needs to be addressed in code, I'm happy to take a look again.

}
}
} catch ( final Exception exception ) {
throw new GATKException(
"Caught unexpected exception on read " + readCounts.totalCounts() + ": " + read1.getName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to close the bam writer here? not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Better to write nothing than to write an incomplete file.

@@ -528,31 +540,18 @@ private static void writeReadCounts() {
}
}

private static void writeMoleculeCounts( final MoleculeCounts moleculeCounts,
private static void writeMoleculeCounts( final ReportTypeCounts counts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method name is a relic from the previous iterations.
But non-essential.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@VisibleForTesting static Reference reference;
@VisibleForTesting static CodonTracker codonTracker; // device for turning SNVs into CodonVariations
@VisibleForTesting static SAMFileGATKReadWriter bamWriter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about naming it something like rejectedReadsBamWriter, for clarity on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1614,21 +1610,81 @@ public void updateCounts( final MoleculeCounts moleculeCounts,
return new Interval(readStart, readEnd);
}

@VisibleForTesting static void updateCountsForPair( final ReadReport report1, final ReadReport report2 ) {
private static Interval calculateShortFragmentTrim( final GATKRead read, final Interval qualityTrim ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest give a short explanation on the difference between the two calculate...Trim functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc for trim methods added

} else if ( report2.getRefCoverage().isEmpty() ) {
report1.updateCounts(moleculeCountsDisjointPair, codonTracker, variationCounts, reference);
if ( !report1.getRefCoverage().isEmpty() ) {
Copy link
Contributor

@SHuang-Broad SHuang-Broad May 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDE tells me report1.getRefCoverage().isEmpty() is always false, which seem to be right given that the enclosing block is triggered only when report1.getRefCoverage().isEmpty() is false.

But I can understand if it is for symmetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha. Just silliness. I removed the redundant test.

if ( report1.getRefCoverage().isEmpty() ) {
report2.updateCounts(moleculeCountsDisjointPair, codonTracker, variationCounts, reference);
if ( !report2.getRefCoverage().isEmpty() ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, is it ever possible, that both coverages are empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that both coverages will be empty. The trimming might pare away the entire alignment for both reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then is the case covered somewhere up in the logic chain? I cannot remember if it is.
And I don't think this case is covered in these if-else block, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ( report1 is empty ) {
    if ( report2 is not empty ) {
        process it
    } /* else {
        do nothing
    } */

It's covered.

private static long totalBaseCalls;
private static final ReportTypeCounts readCounts = new ReportTypeCounts();
private static final ReportTypeCounts unpairedCounts = new ReportTypeCounts();
private static final ReportTypeCounts disjointPairCounts = new ReportTypeCounts();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is disjoint the same as non-overlapping, in the strict technical sense (i.e. hard clipping not included, soft clipping included)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear on what you're asking: We're only dealing with primary lines (no hard clips). But we're also doing our own quality trimming and short-fragment trimming, so it's not just the cigar that dictates whether the evaluated bases in a pair overlap, abut, or are disjoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand now what disjoint means. Thanks!

Copy link
Contributor Author

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, Steve. I think the code is quite a bit clearer now.

@VisibleForTesting static Reference reference;
@VisibleForTesting static CodonTracker codonTracker; // device for turning SNVs into CodonVariations
@VisibleForTesting static SAMFileGATKReadWriter bamWriter;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static long totalBaseCalls;
private static final ReportTypeCounts readCounts = new ReportTypeCounts();
private static final ReportTypeCounts unpairedCounts = new ReportTypeCounts();
private static final ReportTypeCounts disjointPairCounts = new ReportTypeCounts();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear on what you're asking: We're only dealing with primary lines (no hard clips). But we're also doing our own quality trimming and short-fragment trimming, so it's not just the cigar that dictates whether the evaluated bases in a pair overlap, abut, or are disjoint.

}
}
} catch ( final Exception exception ) {
throw new GATKException(
"Caught unexpected exception on read " + readCounts.totalCounts() + ": " + read1.getName(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Better to write nothing than to write an incomplete file.

@@ -528,31 +540,18 @@ private static void writeReadCounts() {
}
}

private static void writeMoleculeCounts( final MoleculeCounts moleculeCounts,
private static void writeMoleculeCounts( final ReportTypeCounts counts,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1614,21 +1610,81 @@ public void updateCounts( final MoleculeCounts moleculeCounts,
return new Interval(readStart, readEnd);
}

@VisibleForTesting static void updateCountsForPair( final ReadReport report1, final ReadReport report2 ) {
private static Interval calculateShortFragmentTrim( final GATKRead read, final Interval qualityTrim ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc for trim methods added

if ( report1.getRefCoverage().isEmpty() ) {
report2.updateCounts(moleculeCountsDisjointPair, codonTracker, variationCounts, reference);
if ( !report2.getRefCoverage().isEmpty() ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that both coverages will be empty. The trimming might pare away the entire alignment for both reads.

} else if ( report2.getRefCoverage().isEmpty() ) {
report1.updateCounts(moleculeCountsDisjointPair, codonTracker, variationCounts, reference);
if ( !report1.getRefCoverage().isEmpty() ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha. Just silliness. I removed the redundant test.

@tedsharpe tedsharpe force-pushed the tws_choose_one_disjoint_report branch from 3e4efb7 to 07a03a2 Compare May 19, 2019 18:17
@SHuang-Broad
Copy link
Contributor

Thanks for making the tool better!

@tedsharpe tedsharpe merged commit cfc744b into master May 20, 2019
@tedsharpe tedsharpe deleted the tws_choose_one_disjoint_report branch May 20, 2019 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants