Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced some of the repeated steps in ReferenceConfidenceModel.calcNIndelinformativeReads #5469

Merged
merged 8 commits into from
Feb 11, 2019

Conversation

jamesemery
Copy link
Collaborator

@jamesemery jamesemery commented Nov 30, 2018

This micro-optimization fell out of profiling of the HaplotypeCaller in GVCF mode.

Profiler view over an Exome before this patch:
screen shot 2018-11-30 at 2 06 34 pm

Profiler view over the same Exome after this patch:
screen shot 2018-11-30 at 2 20 39 pm

I suspect given the remaining 9% runtime could be reduced further by looking more closely at the array operations in isReadInformativeAboutIndelsOfSize()

(It should be noted that these profiler results lie within the ReferenceModelForNoVariation codepath which since this is over an Exome we expect the runtime to overall be skewed towards no-variation blocks)

Resolves #5648

@jamesemery
Copy link
Collaborator Author

As a sanity check, HaplotypeCaller run in GVCF mode over a bam subsetted to only chr15 on my laptop:
master:

real	4m28.600s
user	5m48.484s
sys	0m4.510s

this branch:

real	3m58.043s
user	5m15.625s
sys	0m4.012s

@droazen droazen self-assigned this Nov 30, 2018
@droazen droazen self-requested a review November 30, 2018 20:10
Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to buy you a coffee for the performance improvements, but after looking at the code it looks like maybe I owe you a coffee for fixing my performance regression.

}

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) {
public static Tuple<byte[], byte[]> getBasesAndBaseQualitiesAlginedOneToOne(final GATKRead read, final byte gapCharacter, final byte qualityPadCharacter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algined -> Aligned

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question, but looks good. Let me know what you think about pair vs tuple and when tests pass, merge away.

@@ -6,6 +6,7 @@
import htsjdk.samtools.CigarOperator;
import htsjdk.samtools.SAMFileHeader;
import htsjdk.samtools.util.Locatable;
import htsjdk.samtools.util.Tuple;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any advantage to using Tuple over Pair?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I'm not sure what Pair implementation you are talking about, we have one in gatk that is specific to MarkDuplicates that should probably be renamed to be less confusing anyway...

I think the different tuple implementations are mostly interchangeable but this one happens to live in htsjdk so that is generally a dependency plus.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about org.apache.commons.lang3.tuple.Pair. I've been using it a lot in Funcotator.

@codecov-io
Copy link

codecov-io commented Nov 30, 2018

Codecov Report

Merging #5469 into master will increase coverage by 0.004%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5469       +/-   ##
===============================================
+ Coverage     87.037%   87.041%   +0.004%     
- Complexity     31728     31731        +3     
===============================================
  Files           1943      1943               
  Lines         146193    146225       +32     
  Branches       16141     16143        +2     
===============================================
+ Hits          127242    127275       +33     
  Misses         13064     13064               
+ Partials        5887      5886        -1
Impacted Files Coverage Δ Complexity Δ
...kers/haplotypecaller/ReferenceConfidenceModel.java 92.35% <100%> (-0.042%) 69 <0> (ø)
...nstitute/hellbender/utils/read/AlignmentUtils.java 77.901% <100%> (+0.372%) 170 <1> (+1) ⬆️
.../hellbender/utils/read/AlignmentUtilsUnitTest.java 98.609% <100%> (+0.04%) 304 <2> (+2) ⬆️
...nder/utils/runtime/StreamingProcessController.java 67.773% <0%> (+0.474%) 33% <0%> (ø) ⬇️

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last commit looks good to me

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Review complete, back to you with my comments.

@@ -200,48 +199,57 @@ public static GATKRead createReadAlignedToRef(final GATKRead originalRead,
return Arrays.copyOfRange(bases, basesStart, basesStop + 1);
}

public static byte[] getBasesAlignedOneToOne(final GATKRead read) {
return getSequenceAlignedOneToOne(read, r -> r.getBasesNoCopy(), GAP_CHARACTER);
public static Tuple<byte[], byte[]> getBasesAndBaseQualitiesAlignedOneToOne(final GATKRead read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for this method, and in particular explain the meaning of the return value. Also document the fact that the bases/base quality arrays are not copied and that changing them will alter the corresponding arrays in the read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add good unit tests for the new getBasesAndBaseQualitiesAlignedOneToOne() method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, this method is not actually a new method, just a refactoring of an old method that was being invoked twice.

}

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) {
public static Tuple<byte[], byte[]> getBasesAndBaseQualitiesAlignedOneToOne(final GATKRead read, final byte gapCharacter, final byte qualityPadCharacter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for this method, and in particular explain the meaning of the return value and method parameters. Also document the fact that the bases/base quality arrays are not copied and that changing them will alter the corresponding arrays in the read.

@@ -590,21 +591,20 @@ boolean isReadInformativeAboutIndelsOfSize(final GATKRead read,
// We are safe to use the faster no-copy versions of getBases and getBaseQualities here,
// since we're not modifying the returned arrays in any way. This makes a small difference
// in the HaplotypeCaller profile, since this method is a major hotspot.
final byte[] readBases = AlignmentUtils.getBasesAlignedOneToOne(read); //calls getBasesNoCopy if CIGAR is all match
final byte[] readQuals = AlignmentUtils.getBaseQualsAlignedOneToOne(read);
final Tuple<byte[], byte[]> readBasesAndBaseQualities = AlignmentUtils.getBasesAndBaseQualitiesAlignedOneToOne(read); //calls getBasesNoCopy if CIGAR is all match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jonn-smith on using the Apache commons Pair instead of Tuple for the return value.

}
else {
final byte[] paddedBases = new byte[CigarUtils.countRefBasesIncludingSoftClips(read, 0, cigar.numCigarElements())];
int numberRefBasesIncludingSoftclips = CigarUtils.countRefBasesIncludingSoftClips(read, 0, numCigarElements);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberRefBasesIncludingSoftclips should be final

int literalPos = 0;
int paddedPos = 0;
for ( int i = 0; i < cigar.numCigarElements(); i++ ) {
final CigarElement ce = cigar.getCigarElement(i);
for ( int i = 0; i < read.numCigarElements(); i++ ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the already-initialized numCigarElements here instead of read.numCigarElements()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@droazen droazen assigned jamesemery and unassigned droazen and jonn-smith Dec 10, 2018
@droazen
Copy link
Contributor

droazen commented Jan 16, 2019

@jamesemery Pinging you on this one -- do you want to get this in for the 4.1 release?

@jamesemery jamesemery assigned droazen and unassigned jamesemery Jan 29, 2019
@jamesemery
Copy link
Collaborator Author

@droazen responded to comments back to you

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to you @jamesemery with a second round of comments



final int baselineMMSum = sumMismatchingQualities(readBases, readQuals, readStart, refBases, refStart, Integer.MAX_VALUE);
final int baselineMMSum = sumMismatchingQualities(readBasesAndBaseQualities.getKey(), readBasesAndBaseQualities.getValue(), readStart, refBases, refStart, Integer.MAX_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getLeft() and getRight() here instead of getKey() and getValue()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// consider each indel size up to max in term, checking if an indel that deletes either the ref bases (deletion
// or read bases (insertion) would fit as well as the origin baseline sum of mismatching quality scores
for ( int indelSize = 1; indelSize <= maxIndelSize; indelSize++ ) {
// check insertions:
if (sumMismatchingQualities(readBases, readQuals, readStart + indelSize, refBases, refStart, baselineMMSum) <= baselineMMSum) {
if (sumMismatchingQualities(readBasesAndBaseQualities.getKey(), readBasesAndBaseQualities.getValue(), readStart + indelSize, refBases, refStart, baselineMMSum) <= baselineMMSum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getLeft() and getRight() here instead of getKey() and getValue()

return false;
}
// check deletions:
if (sumMismatchingQualities(readBases, readQuals, readStart, refBases, refStart + indelSize, baselineMMSum) <= baselineMMSum) {
if (sumMismatchingQualities(readBasesAndBaseQualities.getKey(), readBasesAndBaseQualities.getValue(), readStart, refBases, refStart + indelSize, baselineMMSum) <= baselineMMSum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getLeft() and getRight() here instead of getKey() and getValue()

return getSequenceAlignedOneToOne(read, r -> r.getBaseQualitiesNoCopy(), (byte)0);
/**
* Returns the "IGV View" of all the bases and base qualities in a read aligned to the reference according to the cigar, dropping any bases
* that might be in the read but don't aren't in the reference. Any bases that appear in the reference but not the read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't aren't -> aren't

}

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) {
private static Pair<byte[], byte[]> getBasesAndBaseQualitiesAlignedOneToOne(final GATKRead read, final byte gapCharacter, final byte qualityPadCharacter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that basePadCharacter would be a better name than gapCharacter, and more consistent with the qualityPadCharacter arg.

}
else {
final byte[] paddedBases = new byte[CigarUtils.countRefBasesIncludingSoftClips(read, 0, cigar.numCigarElements())];
int numberRefBasesIncludingSoftclips = CigarUtils.countRefBasesIncludingSoftClips(read, 0, numCigarElements);
final byte[] paddedBases = new byte[numberRefBasesIncludingSoftclips];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberRefBasesIncludingSoftclips should be final (looks like you missed this comment from last time?)

for ( int i = 0; i < cigar.numCigarElements(); i++ ) {
final CigarElement ce = cigar.getCigarElement(i);
for ( int i = 0; i < read.numCigarElements(); i++ ) {
final CigarElement ce = read.getCigarElement(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the already-initialized numCigarElements in the for loop condition here instead of read.numCigarElements() (another comment not addressed from last time)



@Test(dataProvider = "makeGetBasesAndBaseQualitiesAlignedOneToOneTest")
public void testCalcNIndelInformativeReads(final String readBases, final String cigar, final String expectedBases, final byte[] expectedQuals ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case is misnamed -- should be testGetBasesAndBaseQualitiesAlignedOneToOne(), not testCalcNIndelInformativeReads()

Pair<byte[], byte[]> actual = AlignmentUtils.getBasesAndBaseQualitiesAlignedOneToOne(read);

Assert.assertEquals(new String(actual.getKey()), expectedBases);
Assert.assertEquals(actual.getValue(), expectedQuals);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getLeft() and getRight()here as well, instead ofgetKey()/getValue()`

*
* @param read a read to return aligned to the reference
* @return A tuple of byte arrays where the first array corresponds to the bases aligned to the reference and second
* array corresponds to the baseQualities aligned to the reference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description of return value is out of date -- it returns a Pair, not a Tuple.

@droazen droazen assigned jamesemery and unassigned droazen Feb 7, 2019
@jamesemery
Copy link
Collaborator Author

@droazen Responded to your comments, thanks for the review. Is this okay to merge now?

@jamesemery jamesemery force-pushed the je_HaplotypeCallerMicroOptimization1 branch from 33c1ce5 to 0de1938 Compare February 7, 2019 20:15
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 latest version looks good -- merge once tests pass

@jamesemery jamesemery merged commit 566e97c into master Feb 11, 2019
@jamesemery jamesemery deleted the je_HaplotypeCallerMicroOptimization1 branch February 11, 2019 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AlignmentUtils.getSequenceAlignedOneToOne needs javadoc
6 participants