Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bamout read annotations for alignement and callable regions, and supported genotypes #5215

Merged
merged 11 commits into from
Oct 12, 2018

Conversation

kachulis
Copy link
Contributor

Adds three annotations to reads in bamout produced by HaplotypeCaller and Mutect2:

  1. AR (Alignment Region)- The reference region over which a read is aligned to a haplotype
  2. CR (Callable Region)- The reference region over which the caller is using this read to call variants
  3. SG (Supported Genotypes)- A list of the genotypes supported by this read for variants called in the callable region

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments

@@ -35,6 +36,9 @@
public final class AssemblyBasedCallerUtils {

static final int REFERENCE_PADDING_FOR_ASSEMBLY = 500;
static final String SUPPORTED_GENOTYPES_TAG="SG";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public

* @param likelihoodsHaplotype ReadLikelihoods containing reads to be annotated along with haplotypes to which these reads have been aligned
* @param callableRegion ref region over which caller is using these ReadLikelihoods to call variants
*/
public static void annotateReadLikelihoodsWithRegions(final ReadLikelihoods<Haplotype> likelihoodsHaplotype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likelihoodsHaplotype should just be likelihoods, IMO

}

/**
* Annotates reads in ReadLiklihoods with genotype supported by each read a particular variant. If a read already has a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a particular variant" seems out of place

public static void annotateReadLikelihoodsWithSupportedGenotypes(final VariantContext vc,
final ReadLikelihoods<Allele> likelihoodsAllele) {
//assign supported Genotypes to each read
final Set<Allele> alleles = new LinkedHashSet<>(vc.getAlleles());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline this into next line?

//assign supported Genotypes to each read
final Set<Allele> alleles = new LinkedHashSet<>(vc.getAlleles());
final Map<Allele, List<Allele>> alleleSubset = alleles.stream().collect(Collectors.toMap(a -> a, Arrays::asList));
final ReadLikelihoods<Allele> subsettedLikelihoodsAllele = likelihoodsAllele.marginalize(alleleSubset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call this subsettedLikelihoods

@@ -130,6 +132,11 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp
final int ploidy = configuration.genotypeArgs.samplePloidy;
final List<Allele> noCallAlleles = GATKVariantContextUtils.noCallAlleles(ploidy);

if (withBamOut) {
////add annotations to reads for alignment regions and calling regions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two slashes

@@ -181,6 +188,11 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp
final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call);
returnCalls.add( annotatedCall );

if (withBamOut) {
//add annotations to reads for which genotypes are supported by each read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not necessary because the method name is self-documenting

@@ -103,6 +105,11 @@ public CalledHaplotypes callMutations(
final Set<Haplotype> calledHaplotypes = new HashSet<>();
final List<VariantContext> returnCalls = new ArrayList<>();

if(withBamOut){
////add annotations to reads for alignment regions and calling regions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two slashes

@@ -172,6 +179,10 @@ public CalledHaplotypes callMutations(
final ReadLikelihoods<Allele> trimmedLikelihoods = log10Likelihoods.marginalize(trimmedToUntrimmedAlleleMap);

final VariantContext annotatedCall = annotationEngine.annotateContext(trimmedCall, featureContext, referenceContext, trimmedLikelihoods, a -> true);
if(withBamOut) {
//add annotations to reads for which genotypes are supported by each read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not necessary

final int start = 13763;
final SimpleInterval loc = new SimpleInterval(contig, start, start + hap1String.length());
final SimpleInterval loc2 = new SimpleInterval(contig, loc.getStart() + 4, loc.getEnd() - 4);
List<Integer> snpIndecies = Arrays.asList(null, 12, 23);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices

@codecov-io
Copy link

codecov-io commented Sep 28, 2018

Codecov Report

Merging #5215 into master will increase coverage by 0.01%.
The diff coverage is 94.91%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5215      +/-   ##
============================================
+ Coverage     86.76%   86.77%   +0.01%     
- Complexity    29944    30000      +56     
============================================
  Files          1838     1838              
  Lines        138820   139051     +231     
  Branches      15285    15329      +44     
============================================
+ Hits         120448   120667     +219     
- Misses        12804    12810       +6     
- Partials       5568     5574       +6
Impacted Files Coverage Δ Complexity Δ
...walkers/haplotypecaller/HaplotypeCallerEngine.java 78.12% <100%> (+0.15%) 74 <0> (ø) ⬇️
...hellbender/tools/walkers/mutect/Mutect2Engine.java 91.08% <100%> (ø) 59 <0> (ø) ⬇️
...kers/haplotypecaller/AssemblyBasedCallerUtils.java 76.42% <100%> (+5.71%) 32 <6> (+6) ⬆️
...plotypecaller/HaplotypeCallerGenotypingEngine.java 79.87% <80%> (ø) 39 <0> (+2) ⬆️
.../tools/walkers/mutect/SomaticGenotypingEngine.java 89.67% <80%> (-0.28%) 73 <0> (+2)
...hellbender/utils/haplotype/HaplotypeBAMWriter.java 96.87% <85.71%> (-1.46%) 15 <1> (+1)
...lotypecaller/AssemblyBasedCallerUtilsUnitTest.java 95.77% <95.28%> (-4.23%) 45 <43> (+43)
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
...te/hellbender/utils/genotyper/ReadLikelihoods.java 90.14% <0%> (+0.4%) 143% <0%> (ø) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8103bde...7d53fb9. Read the comment docs.

for (ReadLikelihoods<Allele>.BestAllele bestAllele : bestAlleles) {
String attribute = "";
GATKRead read = bestAllele.read;
Allele allele = bestAllele.allele;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me why these are supported genotypes. What is the value of the tag supposed to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. The value of the tag is really the alleles of called variants that each read supports. For example, if a read overlaps two called variants in the active region at 1:10000 and 1:10010, and supports the reference allele at the first and the first alt allele at the second, the annotation will be 1:10000=0, 1:10010=1. Essentially this is the same information encoded by allele depth in a vcf. So I think it is more correct to call these supported alleles as opposed to supported genotypes.

}

/**
* Annotates reads in ReadLiklihoods with genotype supported by each read a particular variant. If a read already has a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Annotates reads in ReadLikelihoods with genotype supported by each read at a particular variant." ?

@kachulis
Copy link
Contributor Author

kachulis commented Oct 4, 2018

back to you when you get a chance to take another look, @davidbenjamin and @ldgauthier

@davidbenjamin davidbenjamin removed their assignment Oct 11, 2018
@kachulis kachulis force-pushed the ck_bamout_annotate_reads_active_regions branch from 93c7a9e to 7d53fb9 Compare October 12, 2018 17:20
@kachulis kachulis merged commit 17d17fd into master Oct 12, 2018
@kachulis kachulis deleted the ck_bamout_annotate_reads_active_regions branch October 12, 2018 19:32
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…upported alleles (broadinstitute#5215)

Add bamout read annotations for alignment and callable regions, and supported alleles (broadinstitute#5215)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants