LocusDepthtoBAF tool #7776

tedsharpe · 2022-04-12T20:51:15Z

No description provided.

gatk-bot · 2022-04-12T21:35:47Z

Travis reported job failures from build 38788
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	38788.12	logs

mwalker174

Thank you @tedsharpe this is going to be a major QOL improvement for users, who sometimes spend weeks wrangling SNP calls from various sources.

Mostly minor comments below. How much larger at the LD files after adding the ref / alt indices? These files can live in nearline storage for years, so minimizing their size is important, even if it means slowing down LD -> BAF conversion a bit. It probably compresses out well though.

On a related note, sometimes projects will mix/match samples. It may happen that someone would try to combine LD files generated with different site lists. This is not an issue with PE/SR which are collected genome-wide nor for counts where we enforce a check on the bin intervals. Is it possible to add something to the LD header to ensure consistency? Maybe a hash on the sites list would be sufficient? Or better yet an explicit check that the input LD sites are consistent across all samples? I believe all biallelic snp sites are emitted in CollectSVEvidence.

Edit: we also need to check this against the current pipeline, ie using HaplotypeCaller to call SNPs. I suspect the chi squared model will be good enough, but we should do a head-to-head on at least one sample before merging this.

mwalker174 · 2022-04-20T13:26:20Z

src/main/java/org/broadinstitute/hellbender/tools/sv/BafEvidence.java

@@ -24,6 +24,14 @@ public BafEvidence( final String sample, final String contig,
        this.value = value;
    }

+    // value-altering constructor
+    public BafEvidence( final BafEvidence that, final double value ) {


Can you please add a quick unit test for this?

mwalker174 · 2022-04-20T13:28:02Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java

@@ -11,25 +12,36 @@ public final class LocusDepth implements SVFeature {
    private final String contig;
    private final int position;
    private final String sample;
-    private final byte refCall; // index into nucleotideValues
+    // next three fields index "ACGT"


Can you create an enum type for this?

mwalker174 · 2022-04-20T13:33:24Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java

    public final static String BCI_VERSION = "1.0";

-    public LocusDepth( final Locatable loc, final String sample, final byte refCall ) {
+    public LocusDepth( final Locatable loc, final String sample, final int refIndex, final int altIndex ) {
+        Utils.validateArg(refIndex >= 0 && refIndex <= 3, "refIndex must be between 0 and 3");


How do you feel about Utils.nonNull() as well for loc and sample? I tend to use them where I can in constructors for clarity. You can even combine it with the initialization like this.sample = Utils.nonNull(sample);

mwalker174 · 2022-04-20T13:36:59Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepthtoBAF.java

+
+/**
+ * <p>Merges locus-sorted LocusDepth evidence files, and calculates the bi-allelic frequency for each
+ * sample and site, and writes these values as a BafEvidence output file.</p>


Add a description of the filtering methodology

mwalker174 · 2022-04-20T15:20:55Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepthtoBAF.java

+        final double refDiff = locusDepth.getRefDepth() - expectRefAlt;
+        final double altDiff = locusDepth.getAltDepth() - expectRefAlt;
+        double chiSq = (refDiff * refDiff + altDiff * altDiff) / expectRefAlt;
+        double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);


Add comment that this is a Pearson chi-squared test

Suggested change

double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);

final double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);

mwalker174 · 2022-04-20T15:22:55Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepthtoBAF.java

+        if ( fitProb < minHetLikelihood ) {
+            return 0.;
+        }
+        return (double)locusDepth.getAltDepth() / totalDepth;


Suggested change

return (double)locusDepth.getAltDepth() / totalDepth;

return locusDepth.getAltDepth() / (double) totalDepth;

I can never remember if casts get applied before division

mwalker174 · 2022-04-20T17:25:22Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/CollectSVEvidence.java

@@ -694,14 +695,16 @@ private boolean readNextLocus() {
                return false;
            }
            VariantContext snp = snpSourceItr.next();
-            while ( !snp.isSNP() ) {
+            while ( !snp.isSNP() || !snp.isBiallelic() ) {


Let's future-proof this and allow multiallelic by default, but have a command line flag to restrict only to biallelic

mwalker174 · 2022-04-20T17:28:23Z

src/test/resources/org/broadinstitute/hellbender/tools/walkers/sv/LocusDepthtoBAF/test2.ld.txt

+1	1000	smpl1	0	3	4	0	0	6
+1	1000	smpl2	0	3	5	0	0	5
+1	1000	smpl3	0	3	6	0	0	4


Can you add a few more records at other loci, including some on another contig?

mwalker174 · 2022-04-20T17:49:31Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepthtoBAF.java

+            shortName = StandardArgumentDefinitions.FEATURE_SHORT_NAME )
+    private List<FeatureInput<LocusDepth>> locusDepthFiles;
+
+    @Argument(fullName = SAMPLE_NAMES_NAME, doc = "Sample names", optional = true)


Can you elaborate on what this does in the doc? Is it used for subsetting?

mwalker174 · 2022-04-20T17:51:30Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepthtoBAF.java

+    @Argument(
+            doc = "Minimum likelihood of site being biallelic and heterozygous",
+            fullName = MIN_HET_LIKELIHOOD, optional = true )
+    private double minHetLikelihood = .5;


I think this should be a probability instead of likelihood

tedsharpe · 2022-04-25T18:23:52Z

I believe that this is ready for a re-review. I think all comments were addressed.
In particular, we should check that the handling of multi-allelic sites is what was intended.

codecov · 2022-04-25T18:38:15Z

Codecov Report

Merging #7776 (964de7a) into master (c6eb337) will increase coverage by 60.933%.
The diff coverage is 83.953%.

@@               Coverage Diff                @@
##              master     #7776        +/-   ##
================================================
+ Coverage     18.644%   79.577%   +60.933%     
- Complexity      4635     34845     +30210     
================================================
  Files           1261      2214       +953     
  Lines          73745    173540     +99795     
  Branches       11768     18736      +6968     
================================================
+ Hits           13749    138098    +124349     
+ Misses         57944     29026     -28918     
- Partials        2052      6416      +4364

Impacted Files	Coverage Δ
...ine/GATKPlugin/GATKAnnotationPluginDescriptor.java	`83.851% <ø> (+46.584%)`	⬆️
...ine/GATKPlugin/GATKReadFilterPluginDescriptor.java	`84.615% <ø> (+32.308%)`	⬆️
...nder/cmdline/PicardCommandLineProgramExecutor.java	`63.636% <0.000%> (+63.636%)`	⬆️
...e/argumentcollections/DbsnpArgumentCollection.java	`100.000% <ø> (ø)`
...llections/MultiVariantInputArgumentCollection.java	`100.000% <ø> (+100.000%)`	⬆️
...broadinstitute/hellbender/engine/FeatureCache.java	`92.157% <ø> (+19.608%)`	⬆️
...oadinstitute/hellbender/engine/FeatureContext.java	`81.818% <ø> (+72.727%)`	⬆️
...roadinstitute/hellbender/engine/ProgressMeter.java	`93.814% <ø> (+7.899%)`	⬆️
...g/broadinstitute/hellbender/engine/ReadWalker.java	`100.000% <ø> (+7.407%)`	⬆️
...tute/hellbender/engine/ReadlessAssemblyRegion.java	`46.667% <ø> (+46.667%)`	⬆️
... and 2324 more

mwalker174 · 2022-04-28T16:23:47Z

src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepthtoBAF.java

+            for ( final int altIndex : altIndices ) {
+                double baf = calcBAF(ld, refIndex, altIndex);
+                if ( baf != BafEvidence.MISSING_VALUE ) {
+                    beList.add(new BafEvidence(ld.getSample(), ld.getContig(), ld.getStart(), baf));
+                    break;
+                }
+            }


In some cases this will lead to bias. I didn't think through some of the problems multi-allelics are going to cause us with compatibility here. Apologies for flip-flopping on this, but let's stick to bi-allelic for both LD collection and BAF generation. Regardless, it seems like this implementation would be problematic in some edge cases. I think we will need to either provide the sites list to this tool, or go back to storing the ref/alt alleles in LD (which would be fine, probably simpler and faster).

I'll remove the multi-allelic option, which kind of weirded me out anyway.
However, I'm confused by the last part of your comment: we do provide the sites list to this tool. That's where I'm getting the ref/alt call info, since it's no longer in the LD records. (See lines 210-223.)

Sorry just realized I missed that, all good to just rip out the multi-allelics.

tedsharpe added 2 commits April 12, 2022 16:49

new LocusDepthtoBAF tool

8d74d04

tests

1068211

mwalker174 requested changes Apr 20, 2022

View reviewed changes

address reviewer comments

2d9e9a2

mwalker174 reviewed Apr 28, 2022

View reviewed changes

prohibit multi-allelic sites

964de7a

mwalker174 approved these changes Apr 28, 2022

View reviewed changes

tedsharpe merged commit 9ae1fd8 into master Apr 28, 2022

tedsharpe deleted the tws_LDtoBAF branch April 28, 2022 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LocusDepthtoBAF tool #7776

LocusDepthtoBAF tool #7776

tedsharpe commented Apr 12, 2022

gatk-bot commented Apr 12, 2022

mwalker174 left a comment •

edited

Loading

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

mwalker174 Apr 20, 2022

tedsharpe commented Apr 25, 2022

codecov bot commented Apr 25, 2022 •

edited

Loading

mwalker174 Apr 28, 2022

tedsharpe Apr 28, 2022

mwalker174 Apr 28, 2022

	double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);
	final double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq);

	return (double)locusDepth.getAltDepth() / totalDepth;
	return locusDepth.getAltDepth() / (double) totalDepth;

LocusDepthtoBAF tool #7776

LocusDepthtoBAF tool #7776

Conversation

tedsharpe commented Apr 12, 2022

gatk-bot commented Apr 12, 2022

mwalker174 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe commented Apr 25, 2022

codecov bot commented Apr 25, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 left a comment •

edited

Loading

codecov bot commented Apr 25, 2022 •

edited

Loading