-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduced some of the repeated steps in ReferenceConfidenceModel.calcNIndelinformativeReads #5469
Conversation
As a sanity check, HaplotypeCaller run in GVCF mode over a bam subsetted to only chr15 on my laptop:
this branch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to buy you a coffee for the performance improvements, but after looking at the code it looks like maybe I owe you a coffee for fixing my performance regression.
} | ||
|
||
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) { | ||
public static Tuple<byte[], byte[]> getBasesAndBaseQualitiesAlginedOneToOne(final GATKRead read, final byte gapCharacter, final byte qualityPadCharacter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algined -> Aligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor question, but looks good. Let me know what you think about pair
vs tuple
and when tests pass, merge away.
@@ -6,6 +6,7 @@ | |||
import htsjdk.samtools.CigarOperator; | |||
import htsjdk.samtools.SAMFileHeader; | |||
import htsjdk.samtools.util.Locatable; | |||
import htsjdk.samtools.util.Tuple; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any advantage to using Tuple
over Pair
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, I'm not sure what Pair implementation you are talking about, we have one in gatk that is specific to MarkDuplicates that should probably be renamed to be less confusing anyway...
I think the different tuple implementations are mostly interchangeable but this one happens to live in htsjdk so that is generally a dependency plus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking about org.apache.commons.lang3.tuple.Pair
. I've been using it a lot in Funcotator.
Codecov Report
@@ Coverage Diff @@
## master #5469 +/- ##
===============================================
+ Coverage 87.037% 87.041% +0.004%
- Complexity 31728 31731 +3
===============================================
Files 1943 1943
Lines 146193 146225 +32
Branches 16141 16143 +2
===============================================
+ Hits 127242 127275 +33
Misses 13064 13064
+ Partials 5887 5886 -1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last commit looks good to me
61413bc
to
e26fb36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Review complete, back to you with my comments.
@@ -200,48 +199,57 @@ public static GATKRead createReadAlignedToRef(final GATKRead originalRead, | |||
return Arrays.copyOfRange(bases, basesStart, basesStop + 1); | |||
} | |||
|
|||
public static byte[] getBasesAlignedOneToOne(final GATKRead read) { | |||
return getSequenceAlignedOneToOne(read, r -> r.getBasesNoCopy(), GAP_CHARACTER); | |||
public static Tuple<byte[], byte[]> getBasesAndBaseQualitiesAlignedOneToOne(final GATKRead read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for this method, and in particular explain the meaning of the return value. Also document the fact that the bases/base quality arrays are not copied and that changing them will alter the corresponding arrays in the read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add good unit tests for the new getBasesAndBaseQualitiesAlignedOneToOne()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, this method is not actually a new method, just a refactoring of an old method that was being invoked twice.
} | ||
|
||
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) { | ||
public static Tuple<byte[], byte[]> getBasesAndBaseQualitiesAlignedOneToOne(final GATKRead read, final byte gapCharacter, final byte qualityPadCharacter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for this method, and in particular explain the meaning of the return value and method parameters. Also document the fact that the bases/base quality arrays are not copied and that changing them will alter the corresponding arrays in the read.
@@ -590,21 +591,20 @@ boolean isReadInformativeAboutIndelsOfSize(final GATKRead read, | |||
// We are safe to use the faster no-copy versions of getBases and getBaseQualities here, | |||
// since we're not modifying the returned arrays in any way. This makes a small difference | |||
// in the HaplotypeCaller profile, since this method is a major hotspot. | |||
final byte[] readBases = AlignmentUtils.getBasesAlignedOneToOne(read); //calls getBasesNoCopy if CIGAR is all match | |||
final byte[] readQuals = AlignmentUtils.getBaseQualsAlignedOneToOne(read); | |||
final Tuple<byte[], byte[]> readBasesAndBaseQualities = AlignmentUtils.getBasesAndBaseQualitiesAlignedOneToOne(read); //calls getBasesNoCopy if CIGAR is all match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @jonn-smith on using the Apache commons Pair
instead of Tuple
for the return value.
} | ||
else { | ||
final byte[] paddedBases = new byte[CigarUtils.countRefBasesIncludingSoftClips(read, 0, cigar.numCigarElements())]; | ||
int numberRefBasesIncludingSoftclips = CigarUtils.countRefBasesIncludingSoftClips(read, 0, numCigarElements); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numberRefBasesIncludingSoftclips
should be final
int literalPos = 0; | ||
int paddedPos = 0; | ||
for ( int i = 0; i < cigar.numCigarElements(); i++ ) { | ||
final CigarElement ce = cigar.getCigarElement(i); | ||
for ( int i = 0; i < read.numCigarElements(); i++ ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the already-initialized numCigarElements
here instead of read.numCigarElements()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
@jamesemery Pinging you on this one -- do you want to get this in for the 4.1 release? |
@droazen responded to comments back to you |
9b6fed1
to
37cf03e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to you @jamesemery with a second round of comments
|
||
|
||
final int baselineMMSum = sumMismatchingQualities(readBases, readQuals, readStart, refBases, refStart, Integer.MAX_VALUE); | ||
final int baselineMMSum = sumMismatchingQualities(readBasesAndBaseQualities.getKey(), readBasesAndBaseQualities.getValue(), readStart, refBases, refStart, Integer.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getLeft()
and getRight()
here instead of getKey()
and getValue()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// consider each indel size up to max in term, checking if an indel that deletes either the ref bases (deletion | ||
// or read bases (insertion) would fit as well as the origin baseline sum of mismatching quality scores | ||
for ( int indelSize = 1; indelSize <= maxIndelSize; indelSize++ ) { | ||
// check insertions: | ||
if (sumMismatchingQualities(readBases, readQuals, readStart + indelSize, refBases, refStart, baselineMMSum) <= baselineMMSum) { | ||
if (sumMismatchingQualities(readBasesAndBaseQualities.getKey(), readBasesAndBaseQualities.getValue(), readStart + indelSize, refBases, refStart, baselineMMSum) <= baselineMMSum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getLeft()
and getRight()
here instead of getKey()
and getValue()
return false; | ||
} | ||
// check deletions: | ||
if (sumMismatchingQualities(readBases, readQuals, readStart, refBases, refStart + indelSize, baselineMMSum) <= baselineMMSum) { | ||
if (sumMismatchingQualities(readBasesAndBaseQualities.getKey(), readBasesAndBaseQualities.getValue(), readStart, refBases, refStart + indelSize, baselineMMSum) <= baselineMMSum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getLeft()
and getRight()
here instead of getKey()
and getValue()
return getSequenceAlignedOneToOne(read, r -> r.getBaseQualitiesNoCopy(), (byte)0); | ||
/** | ||
* Returns the "IGV View" of all the bases and base qualities in a read aligned to the reference according to the cigar, dropping any bases | ||
* that might be in the read but don't aren't in the reference. Any bases that appear in the reference but not the read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't aren't
-> aren't
} | ||
|
||
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) { | ||
private static Pair<byte[], byte[]> getBasesAndBaseQualitiesAlignedOneToOne(final GATKRead read, final byte gapCharacter, final byte qualityPadCharacter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that basePadCharacter
would be a better name than gapCharacter
, and more consistent with the qualityPadCharacter
arg.
} | ||
else { | ||
final byte[] paddedBases = new byte[CigarUtils.countRefBasesIncludingSoftClips(read, 0, cigar.numCigarElements())]; | ||
int numberRefBasesIncludingSoftclips = CigarUtils.countRefBasesIncludingSoftClips(read, 0, numCigarElements); | ||
final byte[] paddedBases = new byte[numberRefBasesIncludingSoftclips]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numberRefBasesIncludingSoftclips
should be final (looks like you missed this comment from last time?)
for ( int i = 0; i < cigar.numCigarElements(); i++ ) { | ||
final CigarElement ce = cigar.getCigarElement(i); | ||
for ( int i = 0; i < read.numCigarElements(); i++ ) { | ||
final CigarElement ce = read.getCigarElement(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the already-initialized numCigarElements
in the for loop condition here instead of read.numCigarElements()
(another comment not addressed from last time)
|
||
|
||
@Test(dataProvider = "makeGetBasesAndBaseQualitiesAlignedOneToOneTest") | ||
public void testCalcNIndelInformativeReads(final String readBases, final String cigar, final String expectedBases, final byte[] expectedQuals ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test case is misnamed -- should be testGetBasesAndBaseQualitiesAlignedOneToOne()
, not testCalcNIndelInformativeReads()
Pair<byte[], byte[]> actual = AlignmentUtils.getBasesAndBaseQualitiesAlignedOneToOne(read); | ||
|
||
Assert.assertEquals(new String(actual.getKey()), expectedBases); | ||
Assert.assertEquals(actual.getValue(), expectedQuals); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use getLeft() and
getRight()here as well, instead of
getKey()/
getValue()`
* | ||
* @param read a read to return aligned to the reference | ||
* @return A tuple of byte arrays where the first array corresponds to the bases aligned to the reference and second | ||
* array corresponds to the baseQualities aligned to the reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of return value is out of date -- it returns a Pair, not a Tuple.
@droazen Responded to your comments, thanks for the review. Is this okay to merge now? |
33c1ce5
to
0de1938
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 latest version looks good -- merge once tests pass
This micro-optimization fell out of profiling of the HaplotypeCaller in GVCF mode.
Profiler view over an Exome before this patch:
Profiler view over the same Exome after this patch:
I suspect given the remaining 9% runtime could be reduced further by looking more closely at the array operations in
isReadInformativeAboutIndelsOfSize()
(It should be noted that these profiler results lie within the ReferenceModelForNoVariation codepath which since this is over an Exome we expect the runtime to overall be skewed towards no-variation blocks)
Resolves #5648