Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bg pileup detection #7432

Merged
merged 5 commits into from
Apr 14, 2022
Merged

Bg pileup detection #7432

merged 5 commits into from
Apr 14, 2022

Conversation

bhanugandham
Copy link
Contributor

This is the beta version of the pileup based haplotype generation and variant detection code.

@bhanugandham
Copy link
Contributor Author

@jamesemery as promised here is the draft PR of the pileup code that works for M2 and HC

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have given this code a first pass. Overall it looks good but there are some thing that need to be cleaned up/leftover comments/little things to deal with. I think my biggest comments are that there are a few methods that really need some unit tests about their behavior and I would be happy to talk over how to go about writing those. Given that there is a lot of interest in this code i think it will end up paying off investing the time early to write tests for some of the tricky methods you added (namely the filtering of the discovered haplotypes and some of the code you are using to add haplotypes).

I also think there should be more parameters exposed for this code in case other projects in the future want to try custom configurations of the pileup detection that might be more appropriate for their use case.

@@ -0,0 +1,3 @@
Manifest-Version: 1.0
Main-Class: org.broadinstitute.hellbender.Main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an accidental addition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

// @Test
// public void testAddGivenAlleles() {
// final int assemblyRegionStart = 1;
// final int maxMnpDistance = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reenable these tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -114,6 +123,13 @@ public AssemblyRegion(final SimpleInterval activeSpan, final int padding, final
this(activeSpan, true, padding, header);
}

public List<AlignmentData> getAlignmentData() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment these

addToReadCollection(read, reads);
}

private void addToReadCollection(final GATKRead read, final List<GATKRead> collection) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"addToReadCollection" -> "addToReadCollectionAndValidate"

@@ -113,12 +117,21 @@ public AssemblyRegion next() {
return toReturn;
}

// private void addAlignmentData(AlignmentData alignmentData){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code

}

// now look for indels
if (element.isBeforeInsertion()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an argument in the read threading assembler to enable/disable the indel checks at this stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery unclear what you mean here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some other projects that are interested in this code. They are not necessarily interested in the pileup indel calling code right now (since its complicated and they might have their own versions of this code) Ideally there should be a toggle that disables all of the indel pileup code. (or that disables putting them into the haplotypes).

// TODO: AH & BG add an option to deal with multiple alt alleles
Optional<Map.Entry<Byte, Integer>> maxAlt = altCounts.entrySet().stream().max(Comparator.comparingInt(Map.Entry::getValue));
if (maxAlt.isPresent() && ((float)maxAlt.get().getValue() / (float)numOfBases) > 0.10 && numOfBases >= 5 ) {
// if (maxAlt.isPresent() && ((float)maxAlt.get().getValue() / (float)numOfBases) > 0.1 && maxAlt.get().getValue() > 10 ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more commented code.

alleles.add(Allele.create(referenceContext.getBase(), true));
// TODO: AH & BG add an option to deal with multiple alt alleles
Optional<Map.Entry<Byte, Integer>> maxAlt = altCounts.entrySet().stream().max(Comparator.comparingInt(Map.Entry::getValue));
if (maxAlt.isPresent() && ((float)maxAlt.get().getValue() / (float)numOfBases) > 0.10 && numOfBases >= 5 ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pull this out as an explicit method filterAlleles() or something along those lines. I would also make static variables for these magic numbers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think these should ulitmately be exposed for configuration down the road... There are a few ways to think about doing it but we should probably discuss how that works here.

indelAlleles.add(Allele.create(referenceContext.getBase(), true));
Optional<Map.Entry<String, Integer>> maxIns = insertionCounts.entrySet().stream().max(Comparator.comparingInt(Map.Entry::getValue));
if (maxIns.isPresent() && ((float)maxIns.get().getValue() / (float)numOfBases) > 0.50 && numOfBases >= 5 ) {
indelAlleles.add(Allele.create(String.format("%c", referenceContext.getBase()) + maxIns.get().getKey()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the String.Format() call here since this might get triggered a lot.

@@ -86,6 +88,8 @@ public AssemblyRegionIterator(final MultiIntervalShard<GATKRead> readShard,
this.readCachingIterator = new ReadCachingIterator(readShard.iterator());
this.readCache = new ArrayDeque<>();
this.activityProfile = new BandPassActivityProfile(assemblyRegionArgs.maxProbPropagationDistance, assemblyRegionArgs.activeProbThreshold, BandPassActivityProfile.MAX_FILTER_SIZE, BandPassActivityProfile.DEFAULT_SIGMA, readHeader);
// TODO: AH & BG handle changing contig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaessly can you elaborate what was missing WRT this TODO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a simple test case that would cover the circumstance you were thinking of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something that james warned us about. I think the shards can be from different contigs so I think we needed to add code that checked for that case.

@yfarjoun yfarjoun force-pushed the bg_pileup_detection branch from 58e59ea to 23f29c8 Compare November 29, 2021 02:25
* Percentage of reads required to supprot the alt for a variant to be considered
*/
@Advanced
@Argument(fullName= PILEUP_DETECTION_SNP_THRESHOLD, doc = "Percentage of alt supporting reads in order to consider alt SNP", optional = true)
Copy link
Contributor

@yfarjoun yfarjoun Jan 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename argument PILEUP_SNP_DETECTION_RATIO (also below for INDEL) also, in doc make it clear that this is regarding pileup mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to note in the doc= of all these arguments that it pertains to the PILEUP detection only. otherwise the only hint to that it in the name of the argument itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed all of the arguments to have a "Pileup Detection: " argument. Hopefully that will make it a little bit clearer? I've hidden these arguments anyway so maybe its all moot...

* that are visible in the pileups but might be dropped from assembly for any number of reasons. The basic algorithm works
* as follows:
* - iterate over every pileup and count alt bases
* - (unfinished) detect insertions overlapping this site (CURRENTLY ONLY WORKS FOR INSERTIONS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear what is unfinished...do you want this to work for indels, but it only works for insertions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider the implementation of indels to be relatively untested compared to the SNP code. I'm going to change this to beta.

if (args.badReadProperPair && !read.isProperlyPaired()) {
return true;
}
if (args.badReadSecondary && (read.isSecondaryAlignment() || read.hasAttribute("SA"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the doc for badReadSecondary there's no mention of secondary alignments....and yet, here it is...please make sure we are doing it right....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The DRAGEN heuristic attempts to capture both (on the assumption that those messy reads are going to cause false positives). I'll update the documentation.

}


//TODO this conversion is really unnecessary...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? so why is it done?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its done because the method SequenceUtil.calculateSamNmTag() works on SAMRecords and its in HTSJDK so I can't easily change it... I didn't want to reimplement it but as you can see i have been forced to thread the header down to this class as a result.

}

//TODO add threshold descibed by illumina about insert size compared to the average
if (args.templateLenghtStd > 0 && args.templateLengthMean > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert-size is a much better term than template length.....i think we should use that instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already using samRecordForRead.getInferredInsertSize() or do you mean we should change the title/arguments for this to use "insert size" as a name?

20 10098786 . C T 32.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.431;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=54.65;MQRankSum=-2.200;QD=3.63;ReadPosRankSum=-1.383;SOR=0.132 GT:AD:DP:GQ:PL 0/1:7,2:9:40:40,0,217
20 10099220 . A G 598.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.664;DP=47;ExcessHet=3.0103;FS=1.137;MLEAC=1;MLEAF=0.500;MQ=57.74;MQRankSum=-2.596;QD=13.30;ReadPosRankSum=0.629;SOR=0.519 GT:AD:DP:GQ:PL 0/1:25,20:45:99:606,0,873
20 10099535 . G A 1356.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=5.034;DP=68;ExcessHet=3.0103;FS=2.172;MLEAC=1;MLEAF=0.500;MQ=59.04;MQRankSum=-1.418;QD=20.87;ReadPosRankSum=-0.777;SOR=0.408 GT:AD:DP:GQ:PL 0/1:26,39:65:99:1364,0,686
20 10099565 . C T 1367.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=4.595;DP=70;ExcessHet=3.0103;FS=8.736;MLEAC=1;MLEAF=0.500;MQ=59.40;MQRankSum=-1.266;QD=19.82;ReadPosRankSum=1.225;SOR=0.191 GT:AD:DP:GQ:PL 0/1:31,38:69:99:1375,0,994
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a utuilty method that indexes vcfs for testing, perhaps better to use it than to include indexes in the repository?

Copy link
Collaborator

@jamesemery jamesemery Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at that folder it looks like (with a few exceptions that look like they aren't actually used anywhere and should probably be removed themselves) every file also has the indexes checked in. I'm going to call it out of scope for this PR to change the HC tests over to index on the fly.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with this pass. let me know that you see my comments...they were only put on your last commit.

@jamesemery
Copy link
Collaborator

@yfarjoun rebased (It was a bit messy its quite possible I broke something) and back to you.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see many unresolved reviewer comments from previous passes, but the tests that you added are good.

@jamesemery
Copy link
Collaborator

@yfarjoun I think I have responded to everything in there and gotten rid of a bunch of the todos that were left (it turns out a bunch of them were vestigial from development). There is really only one of my comments I was unsure was worth fixing and I will probably ask around engine team for opinions about it.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few small nits remain. thanks for moving this along!

final Haplotype hapD = new Haplotype("ACCTGAA".getBytes());
final Haplotype hapF = new Haplotype("GAAGAAG".getBytes()); // testing repeated kmers

Map<Kmer, Integer> flatSupportAllKmers = new HashMap<Kmer, Integer>() { private static final long serialVersionUID = 0L; {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the serialVersionUID into a line of its own please (here and below)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure... I'm still annoyed that i have to do this but hash-map is serializable so this anonymous subclass generates a compiler warning otherwise...

@@ -771,7 +771,7 @@ private boolean containsCalls(final CalledHaplotypes calledHaplotypes) {
//TODO - why the activeRegion cannot manage its own one-time finalization and filtering?
//TODO - perhaps we can remove the last parameter of this method and the three lines bellow?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these todos still relevant and correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no but they are old enough now maybe they are load bearing? There is a bunch of this commentary leftover from the first port of HaplotypeCaller back in the day that can probably go...

@jamesemery jamesemery force-pushed the bg_pileup_detection branch from addef19 to ec287d8 Compare March 29, 2022 18:59
@jamesemery
Copy link
Collaborator

@yfarjoun That was a messy rebase. Assuming these tests pass everything should be in order for you to take another look.

@gatk-bot
Copy link

Travis reported job failures from build 38310
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling openjdk8 38310.4 logs

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small nit in one of the comments....thanks!

@jamesemery jamesemery marked this pull request as ready for review April 11, 2022 14:09
@jamesemery jamesemery dismissed their stale review April 11, 2022 14:24

Its my own review and I have made the changes necessary

@jamesemery jamesemery merged commit 9e04333 into master Apr 14, 2022
@jamesemery jamesemery deleted the bg_pileup_detection branch April 14, 2022 14:19
jamesemery pushed a commit to jamesemery/gatk that referenced this pull request Apr 14, 2022
…ore assembly that supplements the assembly variants with variants that show up in the pileups. (broadinstitute#7432)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants